-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] remote: locally index list of checksums available on cloud remotes #3600
Conversation
Regarding the actual index storage format, I looked into using bcolz which was suggested in the remote indexes discussion issue, but it's only available as a source distribution when installing w/pip (no binary wheels for any platform), and trying to build fails on my machine (osx catalina). There are conda binaries available though. |
One question: what happens if someone else runs |
Right now we will just clear the entire local index the first time we try to pull something from the remote that isn't actually there. Taking advantage of the |
So, what about |
As is right now, yes it will skip pushing files that are in the local index (and we would not see that someone else's One solution to this would be to at least check for |
One thought I had was that if |
Discussed this with @efiop , potentially breaking backwards-compatibility and reorganizing our cache structure is something for future discussions In the meantime we can add a @shcheklein thoughts? |
It feels to me that for directories it's totally fine to use I would not worry about performance of checking existences of I would worry for us not saving some data. |
It all feels vary too risky to me. We are playing with data here and need to be conservative. |
CI expected to fail pending merge of #3604 |
- use pytest fixtures - use sha256 digest - s/fd/fobj/ - check more specific IO errors
- `used_cache()`/`get_used_cache()` in repo/stage/output now return tuples of (dir_cache, file_cache) for directories rather than one flat/merged cache - if local index contains directory checksums, they will always be checked on the remote. If an expected .dir checksum is missing, the local index will be invalidated/cleared
- only write changes if index was actually modified since last save/load
dvc/command/data_sync.py
Outdated
"--drop-index", | ||
action="store_true", | ||
default=False, | ||
help="Drop local index for the specified remote cache.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here - let's not use cache
.. @jorgeorpinel could please review messages, btw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. We prefer "remote storage" or just "remote" as of now.
dvc/command/data_sync.py
Outdated
"--drop-index", | ||
action="store_true", | ||
default=False, | ||
help="Drop local index for the specified remote cache.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.... and here
- format contains simple header w/checksum counts - checksums are packed as 4-bytes and then compressed w/gzip
Current flat file format:
For reading and writing an index with 1M checksums, python timeit reports:
Note regarding the read time: index is stored in memory as python sets, |
pull_parser.add_argument( | ||
"--drop-index", | ||
action="store_true", | ||
default=False, | ||
help="Drop local index for the specified remote.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about "specified remote" since sync commands don't always receive a -r arg. Maybe just remove the word "specified".
Also, "drop index" may not mean anything to a casual user, should this be a little more descriptive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same in the other 2 sync commands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about --clear-index
or --reset-index
?
jobs=None, | ||
remote=None, | ||
show_checksums=False, | ||
drop_index=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add drop_index
to docstring Args?
Same in other functions/methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments, sorry for mixed review style.
# We can safely save here, as existing corrupted files will | ||
# be removed upon status, while files corrupted during | ||
# download will not be moved from tmp_file | ||
# (see `RemoteBASE.download()`) | ||
self.repo.state.save(cache_file, checksum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment getting long π
dvc/remote/index.py
Outdated
|
||
|
||
def dump(dir_checksums, file_checksums, fobj, protocol=None): | ||
"""Write specified checksums to the open file object ``fobj``.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: what are double back quotes signaling here? Just curious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old habit from writing sphinx/rst docstrings, I'll clean them up
dvc/remote/index.py
Outdated
protocol = DEFAULT_PROTOCOL | ||
if protocol not in SUPPORTED_PROTOCOLS: | ||
raise DvcException( | ||
"unsupported remote index protocol version: {}".format(protocol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"unsupported remote index protocol version: {}".format(protocol) | |
"unsupported remote index protocol version: '{}'".format(protocol) |
AFAIR we use '
around most dynamic output.
There's a few more instances of this.
Args: | ||
repo: repo for this remote index. | ||
name: name for this index. If name is provided, this index will be | ||
loaded from and saved to ``.dvc/tmp/index/{name}.idx``. | ||
If name is not provided (i.e. for local remotes), this index will | ||
be kept in memory but not saved to disk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this go in a constructor docstring instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaik standard python conventions allow constructor args to be documented in either __init__
or in the class declaration like here. I put them here because that's how it was done in some other dvc classes (see dvc/cache.py::Cache
)
@pmrowla looks great, but a little bit complicated :) could you update the description, please to get the up to date list of things that we do with this PR? |
- there are some places in dvc where we already have separated dir/file checksums and some places that we do not, use update()/replace() where we already have them separated, and update_all()/replace_all() when we have to do the filtering ourself in RemoteIndex
@shcheklein it's been updated. I removed the compression stuff from the index file format since it's not needed, so the index file read/write code is simplified now. |
dvc/remote/base.py
Outdated
if removed: | ||
self.index.invalidate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this more, this is unnecessary.
gc -c
will always fetch the full remote listing via all()
(regardless of what is in our index). So we can treat gc -c
as a full re-indexing of the remote. After gc -c
our index should contain the full remote listing minus the unused checksums which were removed.
After discussion with @efiop, we decided that a flat format (whether binary or json/msgpack/etc) won't be ideal for dealing with large file/checksum counts. And we will also ideally want random access support for cases like running push/pull with a small number of files. A portable database format like sqlite may work for our needs. Will do some testing to see what kind of runtime performance and index file size we would get with sqlite. |
@@ -82,18 +106,20 @@ def _fetch_external(self, repo_url, repo_rev, files, jobs): | |||
if is_dvc_repo: | |||
repo.cache.local.cache_dir = self.cache.local.cache_dir | |||
with repo.state: | |||
cache = NamedCache() | |||
used_cache = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, not sure I understand why this is needed.
This is going to be split into multiple PRs, the .dir checksum push/pull/gc behavior changes will come first |
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.
β I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)
Thank you for the contribution - we'll try to review it as soon as possible. π
Related to #2147.
New behavior
.dvc/tmp/index/<filename>.idx
wherefilename
will be the SHA256 digest of the remote URL.status -c
/push
/pull
/fetch
/gc -c
) index will be updated to include any new checksums which are found on the remote (or are pushed successfully to the remote)--drop-index
can be used to delete the index for the remote (and force re-indexing)Implementation details
remote.cache_exists()
gc -c
), clear the entire index.pull
/fetch
: If any index related mismatch/error occurs (i.e. trying to download a file that is in our index but not on the remote), clear the entire index.gc -c
: Queries and re-indexes the entire remote, regardless of what was in the index before runninggc -c
.RemoteIndex
class/module can be easily extended/modified in the future to support any other protocols which support the usualdump()
/load()
interface (pickle, json, msgpack, etc)Todo
--drop-index
CLI option topush/pull/status
so user can force re-indexing.dir
checksum existence on remote to validate/invalidate indexpush
: push .dir checksums lastgc
: remove .dir checksums firstOther ideas:
push
behavior for.dir
checksums to push the directory file contents first, and the actual.dir
checksum last, we can treat the dir checksum as an index on its own - if the .dir checksum is on exists on the remote, it follows that the directory contents is also on the remotedvc status -c [target]
Β #3568)