Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index: add dvc_data.index.view() and DataIndexView #195

Merged
merged 3 commits into from
Oct 7, 2022

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Oct 5, 2022

Provides read-only view into DataIndex using a filtered set of keys. Does not modify or create a new trie, only wraps the existing index._trie object

Should be usable in any existing dvc-data functions that don't write back to the index (i.e. save, checkout, dvcfs methods, etc)

Implements

  • iteritems
  • traverse
  • ls

@codecov-commenter
Copy link

codecov-commenter commented Oct 5, 2022

Codecov Report

Base: 51.31% // Head: 53.03% // Increases project coverage by +1.72% 🎉

Coverage data is based on head (75fa018) compared to base (047860a).
Patch coverage: 88.81% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #195      +/-   ##
==========================================
+ Coverage   51.31%   53.03%   +1.72%     
==========================================
  Files          46       47       +1     
  Lines        2594     2719     +125     
  Branches      442      465      +23     
==========================================
+ Hits         1331     1442     +111     
- Misses       1198     1211      +13     
- Partials       65       66       +1     
Impacted Files Coverage Δ
src/dvc_data/index/checkout.py 70.37% <ø> (ø)
src/dvc_data/index/index.py 69.64% <82.97%> (+0.30%) ⬆️
src/dvc_data/index/view.py 85.71% <85.71%> (ø)
src/dvc_data/index/save.py 78.43% <88.88%> (+1.83%) ⬆️
src/dvc_data/hashfile/tree.py 60.36% <100.00%> (ø)
src/dvc_data/index/__init__.py 100.00% <100.00%> (ø)
src/dvc_data/index/diff.py 65.71% <100.00%> (ø)
tests/test_index.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

src/dvc_data/index/index.py Outdated Show resolved Hide resolved
@pmrowla pmrowla changed the title index: add index.filter() and DataIndexView [WIP] index: add index.filter() and DataIndexView Oct 5, 2022
@pmrowla pmrowla changed the title [WIP] index: add index.filter() and DataIndexView [WIP] index: add index.view() and DataIndexView Oct 6, 2022
@pmrowla pmrowla changed the title [WIP] index: add index.view() and DataIndexView [WIP] index: add dvc_data.index.view() and DataIndexView Oct 6, 2022
@pmrowla pmrowla changed the title [WIP] index: add dvc_data.index.view() and DataIndexView index: add dvc_data.index.view() and DataIndexView Oct 7, 2022
@efiop efiop merged commit 3313c1a into iterative:main Oct 7, 2022
@pmrowla pmrowla deleted the index-view branch October 7, 2022 10:23
@pmrowla
Copy link
Contributor Author

pmrowla commented Oct 7, 2022

For reference, in my not very scientific test for view.iteritems() performance vs index.iteritems():

  • In the worst case scenario, where the view is mapped to the entire index, performance is significantly worse
  • When the view is smaller than the entire index, performance is significantly better

I would have expected the worst case scenario to give roughly equivalent performance so I'll have to look into that some more.


Using the entire dvc-bench repo and this script:

import timeit

from dvc.repo import Repo
from dvc_data.index import view


repo = Repo(".")
index = repo.index.data["repo"]
print("loading index")
index.load()
index_view = view(index, lambda k: True)
large_view = view(index, lambda k: "large" in k)
mnist_view = view(index, lambda k: "mnist" in k)


def orig_iter():
    for k, v in index.iteritems():
        pass


def make_view():
    view(index, lambda k: True)


def view_iter():
    for k, v in index_view.iteritems():
        pass


def make_large():
    view(index, lambda k: "large" in k)


def large_iter():
    for k, v in large_view.iteritems():
        pass


def make_mnist():
    view(index, lambda k: "mnist" in k)


def mnist_iter():
    for k, v in mnist_view.iteritems():
        pass


print("index.iteritems", timeit.timeit("orig_iter()", number=10, globals=globals()))
print("construct view (worst case)", timeit.timeit("make_view()", number=10, globals=globals()))
print("view.iteritems (worst case)", timeit.timeit("view_iter()", number=10, globals=globals()))
print("construct view (large)", timeit.timeit("make_large()", number=10, globals=globals()))
print("view.iteritems (large)", timeit.timeit("large_iter()", number=10, globals=globals()))
print("construct view (mnist)", timeit.timeit("mnist_large()", number=10, globals=globals()))
print("view.iteritems (mnist)", timeit.timeit("mnist_iter()", number=10, globals=globals()))
loading index
index.iteritems 0.7863161660000006
construct view (worst case) 1.3461258339999986
view.iteritems (worst case) 1.877840208000002
construct view (large) 0.0002096250000001021
view.iteritems (large) 9.167000001752967e-06
construct view (mnist) 0.00015529199999875232
view.iteritems (mnist) 5.417000000562666e-06

Where you get the performance for iterating over the entire index, constructing + iterating over the worst case scenario view, and constructing + iterating over only data/large and data/mnist views. (data/mnist is almost 3/4 of the entire amount of data in the entire index)

@pmrowla pmrowla mentioned this pull request Dec 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants