index: add dvc_data.index.view() and DataIndexView #195

pmrowla · 2022-10-05T08:27:09Z

Provides read-only view into DataIndex using a filtered set of keys. Does not modify or create a new trie, only wraps the existing index._trie object

Should be usable in any existing dvc-data functions that don't write back to the index (i.e. save, checkout, dvcfs methods, etc)

Implements

iteritems
traverse
ls

codecov-commenter · 2022-10-05T08:32:33Z

Codecov Report

Base: 51.31% // Head: 53.03% // Increases project coverage by +1.72% 🎉

Coverage data is based on head (75fa018) compared to base (047860a).
Patch coverage: 88.81% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #195      +/-   ##
==========================================
+ Coverage   51.31%   53.03%   +1.72%     
==========================================
  Files          46       47       +1     
  Lines        2594     2719     +125     
  Branches      442      465      +23     
==========================================
+ Hits         1331     1442     +111     
- Misses       1198     1211      +13     
- Partials       65       66       +1

Impacted Files	Coverage Δ
src/dvc_data/index/checkout.py	`70.37% <ø> (ø)`
src/dvc_data/index/index.py	`69.64% <82.97%> (+0.30%)`	⬆️
src/dvc_data/index/view.py	`85.71% <85.71%> (ø)`
src/dvc_data/index/save.py	`78.43% <88.88%> (+1.83%)`	⬆️
src/dvc_data/hashfile/tree.py	`60.36% <100.00%> (ø)`
src/dvc_data/index/__init__.py	`100.00% <100.00%> (ø)`
src/dvc_data/index/diff.py	`65.71% <100.00%> (ø)`
tests/test_index.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

src/dvc_data/index/index.py

pmrowla · 2022-10-07T10:36:20Z

For reference, in my not very scientific test for view.iteritems() performance vs index.iteritems():

In the worst case scenario, where the view is mapped to the entire index, performance is significantly worse
When the view is smaller than the entire index, performance is significantly better

I would have expected the worst case scenario to give roughly equivalent performance so I'll have to look into that some more.

Using the entire dvc-bench repo and this script:

import timeit

from dvc.repo import Repo
from dvc_data.index import view


repo = Repo(".")
index = repo.index.data["repo"]
print("loading index")
index.load()
index_view = view(index, lambda k: True)
large_view = view(index, lambda k: "large" in k)
mnist_view = view(index, lambda k: "mnist" in k)


def orig_iter():
    for k, v in index.iteritems():
        pass


def make_view():
    view(index, lambda k: True)


def view_iter():
    for k, v in index_view.iteritems():
        pass


def make_large():
    view(index, lambda k: "large" in k)


def large_iter():
    for k, v in large_view.iteritems():
        pass


def make_mnist():
    view(index, lambda k: "mnist" in k)


def mnist_iter():
    for k, v in mnist_view.iteritems():
        pass


print("index.iteritems", timeit.timeit("orig_iter()", number=10, globals=globals()))
print("construct view (worst case)", timeit.timeit("make_view()", number=10, globals=globals()))
print("view.iteritems (worst case)", timeit.timeit("view_iter()", number=10, globals=globals()))
print("construct view (large)", timeit.timeit("make_large()", number=10, globals=globals()))
print("view.iteritems (large)", timeit.timeit("large_iter()", number=10, globals=globals()))
print("construct view (mnist)", timeit.timeit("mnist_large()", number=10, globals=globals()))
print("view.iteritems (mnist)", timeit.timeit("mnist_iter()", number=10, globals=globals()))

loading index
index.iteritems 0.7863161660000006
construct view (worst case) 1.3461258339999986
view.iteritems (worst case) 1.877840208000002
construct view (large) 0.0002096250000001021
view.iteritems (large) 9.167000001752967e-06
construct view (mnist) 0.00015529199999875232
view.iteritems (mnist) 5.417000000562666e-06

Where you get the performance for iterating over the entire index, constructing + iterating over the worst case scenario view, and constructing + iterating over only data/large and data/mnist views. (data/mnist is almost 3/4 of the entire amount of data in the entire index)

index: separate read-only methods into BaseDataIndex

5d97e90

pmrowla self-assigned this Oct 5, 2022

pmrowla requested a review from efiop October 5, 2022 08:27

pmrowla force-pushed the index-view branch from 822d413 to 94f411e Compare October 5, 2022 08:27

pmrowla commented Oct 5, 2022

View reviewed changes

src/dvc_data/index/index.py Outdated Show resolved Hide resolved

pmrowla changed the title ~~index: add index.filter() and DataIndexView~~ [WIP] index: add index.filter() and DataIndexView Oct 5, 2022

pmrowla force-pushed the index-view branch from 94f411e to 7c9eb28 Compare October 6, 2022 09:12

pmrowla changed the title ~~[WIP] index: add index.filter() and DataIndexView~~ [WIP] index: add index.view() and DataIndexView Oct 6, 2022

pmrowla changed the title ~~[WIP] index: add index.view() and DataIndexView~~ [WIP] index: add dvc_data.index.view() and DataIndexView Oct 6, 2022

pmrowla force-pushed the index-view branch from 7c9eb28 to bf90e6a Compare October 6, 2022 10:58

pmrowla added 2 commits October 6, 2022 20:05

index: add read-only DataIndexView

178bb24

index/hashfile: mypy fixes

75fa018

pmrowla force-pushed the index-view branch from bf90e6a to 75fa018 Compare October 6, 2022 11:05

pmrowla mentioned this pull request Oct 7, 2022

index: add IndexView, brancher: support index iterative/dvc#8407

Merged

2 tasks

pmrowla changed the title ~~[WIP] index: add dvc_data.index.view() and DataIndexView~~ index: add dvc_data.index.view() and DataIndexView Oct 7, 2022

efiop approved these changes Oct 7, 2022

View reviewed changes

efiop merged commit 3313c1a into iterative:main Oct 7, 2022

pmrowla deleted the index-view branch October 7, 2022 10:23

pmrowla mentioned this pull request Dec 8, 2022

index: lazy load views #242

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index: add dvc_data.index.view() and DataIndexView #195

index: add dvc_data.index.view() and DataIndexView #195

pmrowla commented Oct 5, 2022 •

edited

Loading

codecov-commenter commented Oct 5, 2022 •

edited

Loading

pmrowla commented Oct 7, 2022 •

edited

Loading

index: add dvc_data.index.view() and DataIndexView #195

index: add dvc_data.index.view() and DataIndexView #195

Conversation

pmrowla commented Oct 5, 2022 • edited Loading

codecov-commenter commented Oct 5, 2022 • edited Loading

Codecov Report

pmrowla commented Oct 7, 2022 • edited Loading

pmrowla commented Oct 5, 2022 •

edited

Loading

codecov-commenter commented Oct 5, 2022 •

edited

Loading

pmrowla commented Oct 7, 2022 •

edited

Loading