-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
index: add dvc_data.index.view() and DataIndexView #195
Conversation
Codecov ReportBase: 51.31% // Head: 53.03% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #195 +/- ##
==========================================
+ Coverage 51.31% 53.03% +1.72%
==========================================
Files 46 47 +1
Lines 2594 2719 +125
Branches 442 465 +23
==========================================
+ Hits 1331 1442 +111
- Misses 1198 1211 +13
- Partials 65 66 +1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
For reference, in my not very scientific test for
I would have expected the worst case scenario to give roughly equivalent performance so I'll have to look into that some more. Using the entire import timeit
from dvc.repo import Repo
from dvc_data.index import view
repo = Repo(".")
index = repo.index.data["repo"]
print("loading index")
index.load()
index_view = view(index, lambda k: True)
large_view = view(index, lambda k: "large" in k)
mnist_view = view(index, lambda k: "mnist" in k)
def orig_iter():
for k, v in index.iteritems():
pass
def make_view():
view(index, lambda k: True)
def view_iter():
for k, v in index_view.iteritems():
pass
def make_large():
view(index, lambda k: "large" in k)
def large_iter():
for k, v in large_view.iteritems():
pass
def make_mnist():
view(index, lambda k: "mnist" in k)
def mnist_iter():
for k, v in mnist_view.iteritems():
pass
print("index.iteritems", timeit.timeit("orig_iter()", number=10, globals=globals()))
print("construct view (worst case)", timeit.timeit("make_view()", number=10, globals=globals()))
print("view.iteritems (worst case)", timeit.timeit("view_iter()", number=10, globals=globals()))
print("construct view (large)", timeit.timeit("make_large()", number=10, globals=globals()))
print("view.iteritems (large)", timeit.timeit("large_iter()", number=10, globals=globals()))
print("construct view (mnist)", timeit.timeit("mnist_large()", number=10, globals=globals()))
print("view.iteritems (mnist)", timeit.timeit("mnist_iter()", number=10, globals=globals()))
Where you get the performance for iterating over the entire index, constructing + iterating over the worst case scenario view, and constructing + iterating over only |
Provides read-only view into DataIndex using a filtered set of keys. Does not modify or create a new trie, only wraps the existing
index._trie
objectShould be usable in any existing dvc-data functions that don't write back to the index (i.e. save, checkout, dvcfs methods, etc)
Implements
iteritems
traverse
ls