-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remote: locally index list of checksums available on cloud remotes #3634
Conversation
6a74efd
to
b1314bd
Compare
INDEX_TABLE = "remote_index" | ||
INDEX_TABLE_LAYOUT = "checksum TEXT PRIMARY KEY, " "dir INTEGER NOT NULL" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are using a database, would it be better to have a single .dvc/tmp/index db, with appropriate tables/columns for handling multiple remotes, vs having multiple single-table .dvc/tmp/index/... dbs?
I'm not sure how common it is for users to have multiple remotes configured, but in the case where you have multiple large remotes, as it is right now we would be duplicating data across multiple (potentially large) index db files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmrowla If we are talking about storing unpacked *.dir cache files, then there doesn't seem to be a point to store them for each remote, since no matter the remote if it has *.dir present we trust that it also has all cache files from within it. So maybe it is a premature optimization for later general indexes for whole remotes.
2bf0e56
to
6e6c450
Compare
Some benchmarks from my machine (dual-core i7-5557U @ 3.10GHz, osx catalina): S3 remote, 2M imagenet files, w/existing state database, but no pulled/fetched files latest PR branch, with no existing index file (equivalent to first time running
latest PR branch, with existing index file
latest master
0.91.0 default (pre-optimizations)
0.91.0 no-traverse = False
rclone (w/sanitized remote url)
The traverse improvements from 0.91.0 to now should be more drastic on machines with more cores. |
@pmrowla And what about the case when we add 1(or just n << N) new file to the dataset? I think it should also shine there, compared to what we had before index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some unnecessary fixture includes
Had discussion with @efiop, a couple of changes needed before this can be merged.
Will also check cprofile results and update benchmarks |
cd0c23e
to
83cd0f2
Compare
Previous benchmarks post has been updated with current master and PR branch numbers for Running with cProfile, pathinfo handling code is a large bottleneck for us when checking cloud status, particularly in from zip w/cprofile output: |
Benchmarks for S3 remote, 100k total files in remote cache latest PR branch
latest master
0.91.0 default
0.91.0 no_traverse = False
rclone
|
@pmrowla amazing stuff π - ~7 times faster than rclone. I'm curious where do we spend 13seconds though in this scenario π€ |
@pmrowla is it |
Also, out of curiosity - do we actually need |
Thinking about it, looks like current --clear-index simply drops the index for specific remote. So maybe it would be more logical do create Looks like we could indeed drop it for now until someone asks. I think it was me who asked for this flag awhile ago, but it no longer looks like a great idea to me, I think i was thinking about something like |
- do not index files on partial push/upload
- skip unnecessary index check for dir contents if .dir file exists on the remote
e6d48d6
to
fc6efed
Compare
fc6efed
to
63c7d9d
Compare
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π₯
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.
β I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)
Thank you for the contribution - we'll try to review it as soon as possible. π
Index changes from #3600.
Related to #2147.
Docs PR: iterative/dvc.org#1163
New behavior
.dvc/tmp/index/<filename>.idx
wherefilename
is the SHA256 digest of the remote URL.status -c
/push
/pull
/fetch
) index will be updated to include any new .dir's which are found on the remote (or are pushed successfully to the remote)Implementation details
remote.status/remote.cache_exists
gc -c
), clear the entire index.push
'd since we last queried remote status)pull
/fetch
: If any index related mismatch/error occurs (i.e. trying to download a file that is in our index but not on the remote), clear the entire index.push
: After push operation, successfully uploaded .dir checksums and file contents checksums for the directory contents are added to the index. Files uploaded during a partially failed push will not be indexedgc -c
: Re-index will be required after agc -c
operationTodo
.dir
checksum existence on remote to validate/invalidate index