Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: remove fs exists check in plots, parallel data collect #8777

Merged
merged 1 commit into from
Jan 11, 2023

Conversation

shcheklein
Copy link
Member

@shcheklein shcheklein commented Jan 7, 2023

Related #8786

Before:

/Users/ivan/Projects/vscode-dvc-demo/.venv/bin/python -m dvc plots diff 11295f0 9aa0603 e29c9be workspace -o .dvc/tmp/plots --split --json - COMPLETED (53199ms)

After:

/Users/ivan/Projects/vscode-dvc-demo/.venv/bin/python -m dvc plots diff 11295f0 9aa0603 e29c9be workspace -o .dvc/tmp/plots --split --json - COMPLETED (9404ms)

.. and not a huge change to maintain to my mind.

still very slow though, but I think it now goes into brancher, path collection, etc - which also should be parallelized, indexed and cached.


Thank you for the contribution - we'll try to review it as soon as possible. 🙏

@shcheklein shcheklein added performance improvement over resource / time consuming tasks A: plots Related to the plots labels Jan 7, 2023
@shcheklein shcheklein requested review from efiop and skshetry January 7, 2023 02:17
@@ -39,13 +38,6 @@ def _collect_paths(
for fs_path in fs_paths:
if recursive and fs.isdir(fs_path):
target_paths.extend(fs.find(fs_path))

rel = fs.path.relpath(fs_path)
if not fs.exists(fs_path):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment: Here it is internally also hits remote storage. Warnings can be moved to the bottom instead? And we need to make them compact anyways + in the #7692 we will need to handle this better also

@codecov
Copy link

codecov bot commented Jan 7, 2023

Codecov Report

Base: 93.53% // Head: 93.57% // Increases project coverage by +0.03% 🎉

Coverage data is based on head (e09c42f) compared to base (a060049).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8777      +/-   ##
==========================================
+ Coverage   93.53%   93.57%   +0.03%     
==========================================
  Files         457      457              
  Lines       36253    36251       -2     
  Branches     5261     5258       -3     
==========================================
+ Hits        33910    33922      +12     
+ Misses       1836     1824      -12     
+ Partials      507      505       -2     
Impacted Files Coverage Δ
dvc/repo/experiments/show.py 91.78% <ø> (ø)
tests/unit/test_collect.py 100.00% <ø> (ø)
dvc/repo/collect.py 100.00% <100.00%> (ø)
dvc/repo/metrics/show.py 95.40% <100.00%> (ø)
dvc/repo/params/show.py 94.00% <100.00%> (ø)
dvc/repo/plots/__init__.py 87.54% <100.00%> (+0.45%) ⬆️
tests/func/metrics/test_show.py 100.00% <100.00%> (ø)
tests/func/plots/test_show.py 100.00% <100.00%> (ø)
dvc/proc/manager.py 74.73% <0.00%> (-1.06%) ⬇️
dvc/repo/experiments/queue/celery.py 88.05% <0.00%> (+1.86%) ⬆️
... and 1 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@shcheklein shcheklein marked this pull request as ready for review January 7, 2023 03:38
@skshetry
Copy link
Member

skshetry commented Jan 9, 2023

I haven't seen the difference as stated in the description above in my machine (Linux). For me, it's quite opposite, looks like there's an overhead of ThreadPoolExecutor and is couple of seconds slower (and sometimes 10-15 seconds slower, tested on vscode-demo repo with no cache). But I guess a lot of that also depends on the network. I'll try to benchmark more tomorrow, but the code overall lgtm. Thanks @shcheklein.

@shcheklein
Copy link
Member Author

@skshetry what exactly did you try to run? Could you share the commands, repo?

@skshetry
Copy link
Member

I cloned vscode-dvc repo and then updated the submodule, and ran the following commands:

$ (cd ../../dvc && git checkout main)
$ rm -rf .dvc/cache
$ time dvc plots diff 11295f0 e29c9be workspace -o .dvc/tmp/plots --split --json > /dev/null
1.44s user 0.10s system 3% cpu 41.534 total

$ rm -rf .dvc/cache
$ time dvc plots show
0.66s user 0.06s system 5% cpu 13.611 total

# with cache
$ time dvc plots show
0.67s user 0.06s system 4% cpu 15.944 total
$ time dvc plots diff 11295f0 e29c9be workspace -o .dvc/tmp/plots --split --json > /dev/null
1.61s user 0.08s system 3% cpu 43.946 total
$ (cd ../../dvc && git checkout parallel-plots)
$ rm -rf .dvc/cache
$ time dvc plots diff 11295f0 e29c9be workspace -o .dvc/tmp/plots --split --json > /dev/null
1.50s user 0.11s system 4% cpu 33.071 total

$ rm -rf .dvc/cache
$ time dvc plots show
0.63s user 0.06s system 8% cpu 7.994 total

# with cache
$ time dvc plots show
0.65s user 0.05s system 11% cpu 6.156 total
$ time dvc plots diff 11295f0 e29c9be workspace -o .dvc/tmp/plots --split --json > /dev/null
1.51s user 0.11s system 8% cpu 18.143 total

I think it was my internet itself that was slow yesterday, so there's a clear speedup.

@shcheklein
Copy link
Member Author

Thanks @skshetry , I'll migrate to the a different ThreadPool and merge this tomorrow.

@skshetry
Copy link
Member

Thanks @skshetry , I'll migrate to the a different ThreadPool and merge this tomorrow.

We can merge this and followup as well.

assert callable(data_source)
value.update(data_source())

if len(to_resolve) > 1:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skshetry do you know what it the overhead creating a TP? It feels it can up to a second? I put some optimizations to avoid that ... I wonder of we can detect that files are in cache already on all of them are local to avoid creating it at all

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disregard this, I'm going to remove all these micro optimizations - I don't think it's worth doing this tbh


# Yield must be hidden in closure so that the futures are submitted
# before the first iterator value is required.
def result_iterator(tasks):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skshetry had to apply a fix from the Executor.map for this, otherwise it was not working in my case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably propagate this change to dvc_obejcts/dvc_data

Copy link
Member

@skshetry skshetry Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shcheklein, let's not change this. You can do something like the following:

with executor:
    list(executor.imap_unordered(resolve, to_resolve))

It is called imap_unordered, because it is lazy and does not guarantee ordering.

Also see multiprocessing.pool.Pool.imap and multiprocessing.pool.Pool.imap_unordered.

Copy link
Member

@skshetry skshetry Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I am not against a new blocking map API, but that has expectation of ordering due to existing ThreadPoolExecutor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, agreed. Let's keep it. It still stays somewhat lazy, but I see that we want it to be strictly lazy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s not change. We want lazier, stricter but one with weaker guarantees. We could add other APIs too, but that might add more maintenance burden as we also have some backports here. Also we need to keep this in sync with other projects.

It is simpler to use list() to make it block.

@shcheklein shcheklein force-pushed the parallel-plots branch 3 times, most recently from 6299300 to 1f1ed6e Compare January 11, 2023 04:07
@shcheklein shcheklein enabled auto-merge (rebase) January 11, 2023 04:18
@shcheklein shcheklein merged commit d4bfe38 into main Jan 11, 2023
@shcheklein shcheklein deleted the parallel-plots branch January 11, 2023 04:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: plots Related to the plots performance improvement over resource / time consuming tasks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants