Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] remote: locally index list of checksums available on cloud remotes #3600

Closed
wants to merge 15 commits into from

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Apr 6, 2020

  • ❗ I have followed the Contributing to DVC checklist.

  • πŸ“– If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.

  • ❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Related to #2147.

New behavior

  • Checksums available on the remote are indexed (cached) locally in .dvc/tmp/index/<filename>.idx where filename will be the SHA256 digest of the remote URL.
  • On any dvc command which queries remote status (status -c/push/pull/fetch/gc -c) index will be updated to include any new checksums which are found on the remote (or are pushed successfully to the remote)
  • --drop-index can be used to delete the index for the remote (and force re-indexing)

Implementation details

  • remote.cache_exists()
    1. Validate our index by querying for .dir checksums on the remote. If a .dir checksum is in our index, but not the remote (i.e. someone else has run gc -c), clear the entire index.
    2. Look for checksums in the index - If all of the checksums being queried are already in the local index, the remote will not be queried at all.
    3. Any checksums which are not in the local index will be queried on the remote as usual, and the local index will be updated to include any new remote cache entries. (So doing a full list/traverse of the remote will is the same as (re)indexing the entire remote)
  • pull/fetch: If any index related mismatch/error occurs (i.e. trying to download a file that is in our index but not on the remote), clear the entire index.
  • gc -c: Queries and re-indexes the entire remote, regardless of what was in the index before running gc -c.
  • Index file will be read once at the start of relevant dvc commands, and written to once at most (only if the index was modified during the dvc command)
  • RemoteIndex class/module can be easily extended/modified in the future to support any other protocols which support the usual dump()/load() interface (pickle, json, msgpack, etc)

Todo

  • Figure out an actual storage solution (currently written in flat binary file format)
  • add --drop-index CLI option to push/pull/status so user can force re-indexing
  • Use .dir checksum existence on remote to validate/invalidate index
  • push: push .dir checksums last
  • gc: remove .dir checksums first
  • unit tests for index module
  • functional tests for new behavior
  • docs

Other ideas:

  • Only index .dir checksums to keep local size to a minimum? This was discussed in perf: remote indexesΒ #2147, if we modify push behavior for .dir checksums to push the directory file contents first, and the actual .dir checksum last, we can treat the dir checksum as an index on its own - if the .dir checksum is on exists on the remote, it follows that the directory contents is also on the remote
  • Index file size as well as checksums? (also related to Add summary to dvc status -c [target] Β #3568)

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 6, 2020

Regarding the actual index storage format, I looked into using bcolz which was suggested in the remote indexes discussion issue, but it's only available as a source distribution when installing w/pip (no binary wheels for any platform), and trying to build fails on my machine (osx catalina). There are conda binaries available though.

dvc/repo/__init__.py Outdated Show resolved Hide resolved
dvc/remote/index.py Outdated Show resolved Hide resolved
dvc/remote/index.py Outdated Show resolved Hide resolved
dvc/remote/base.py Outdated Show resolved Hide resolved
dvc/remote/index.py Outdated Show resolved Hide resolved
dvc/remote/base.py Outdated Show resolved Hide resolved
dvc/remote/index.py Outdated Show resolved Hide resolved
@shcheklein
Copy link
Member

One question: what happens if someone else runs dvc gc? do we check for this somehow. (I remember we had some ideas about directories at least - using .dir file as a flag).

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 7, 2020

One question: what happens if someone else runs dvc gc? do we check for this somehow. (I remember we had some ideas about directories at least - using .dir file as a flag).

Right now we will just clear the entire local index the first time we try to pull something from the remote that isn't actually there. Taking advantage of the .dir files in some way is planned though

@shcheklein
Copy link
Member

So, what about push (and e.g. status) in this case? Will it skip files that it thinks (according to the local index cache) that already exists remotely?

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 7, 2020

As is right now, yes it will skip pushing files that are in the local index (and we would not see that someone else's gc -c operation had removed them from the remote).

One solution to this would be to at least check for .dir files on the remote no matter what. But in the event that there are enough .dir checksums to in our query to make us do the full list/traverse of the remote, we wouldn't actually be improving anything in terms of performance.

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 7, 2020

One thought I had was that if .dir checksums were stored separately from file checksums in our cache (so that .dir's were under their own prefix), we would be able to fetch them without having to potentially query the entire remote cache, but that would be a pretty significant design change

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 7, 2020

Discussed this with @efiop , potentially breaking backwards-compatibility and reorganizing our cache structure is something for future discussions

In the meantime we can add a --drop-index (or similarly named) CLI option to allow users to force re-indexing. So in the event that the user knows some other team member has run gc -c, the user can run status -c/push/pull --drop-index and force us to re-index the remote cache.

@shcheklein thoughts?

@shcheklein
Copy link
Member

It feels to me that for directories it's totally fine to use exists? It's quite unusual edge-case to have 1M directories for example. It also means 1M DVC-files?

I would not worry about performance of checking existences of .dir files.

I would worry for us not saving some data.

@shcheklein
Copy link
Member

In the meantime we can add a --drop-index (or similarly named) CLI option to allow users to force re-indexing. So in the event that the user knows some other team member has run gc -c, the user can run status -c/push/pull --drop-index and force us to re-index the remote cache.

It all feels vary too risky to me. We are playing with data here and need to be conservative.

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 7, 2020

CI expected to fail pending merge of #3604

pmrowla added 2 commits April 7, 2020 16:53
- use pytest fixtures
- use sha256 digest
- s/fd/fobj/
- check more specific IO errors
- `used_cache()`/`get_used_cache()` in repo/stage/output now return tuples
  of (dir_cache, file_cache) for directories rather than one flat/merged
  cache
- if local index contains directory checksums, they will always be
  checked on the remote. If an expected .dir checksum is missing, the
  local index will be invalidated/cleared
@efiop efiop added p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks labels Apr 7, 2020
pmrowla added 2 commits April 9, 2020 14:14
- only write changes if index was actually modified since last save/load
"--drop-index",
action="store_true",
default=False,
help="Drop local index for the specified remote cache.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here - let's not use cache .. @jorgeorpinel could please review messages, btw?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. We prefer "remote storage" or just "remote" as of now.

"--drop-index",
action="store_true",
default=False,
help="Drop local index for the specified remote cache.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.... and here

- format contains simple header w/checksum counts
- checksums are packed as 4-bytes and then compressed w/gzip
@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 9, 2020

Current flat file format:

  • Basic flat file (with a header) where MD5 checksums are stored as 16-byte binary arrays.
  • .dir suffix is not stored in the index file itself for .dir checksums
    • Suffix will be stripped on writing the index file, and appended on reading the index file
    • The # of dir and file checksums is included in the index file header, and the first dir_checksum_count checksums in the file are .dir checksums, and all remaining checksums are regular file checksums.
# (all integer types are little-endian)
#
# Header
# ------
#   protocol_version (32-bit uint): index file protocol version
#   dir_checksum_count (64-bit uint): number of .dir checksums in data section
#   file_checksum_count (64-bit uint): number of file checksums in data section
#   crc - 32-bit CRC32 checksum of entire data section
#
# Data section
# ----------------------
#   array of <dir_checksum_count> 128-bit MD5 .dir checksums
#   array of <file_checksum_count> 128-bit MD5 file checksums

For reading and writing an index with 1M checksums, python timeit reports:

dump(): 0.27s    # write
load(): 0.60s    # read

Note regarding the read time: index is stored in memory as python sets, load() time includes the overhead from reading checksums into sets. When reading/appending into a flat list, load time is under .3s.

@skshetry skshetry self-requested a review April 9, 2020 11:27
Comment on lines +163 to +167
pull_parser.add_argument(
"--drop-index",
action="store_true",
default=False,
help="Drop local index for the specified remote.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about "specified remote" since sync commands don't always receive a -r arg. Maybe just remove the word "specified".

Also, "drop index" may not mean anything to a casual user, should this be a little more descriptive?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same in the other 2 sync commands.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about --clear-index or --reset-index?

jobs=None,
remote=None,
show_checksums=False,
drop_index=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add drop_index to docstring Args?
Same in other functions/methods.

Copy link
Contributor

@jorgeorpinel jorgeorpinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments, sorry for mixed review style.

Comment on lines +121 to +125
# We can safely save here, as existing corrupted files will
# be removed upon status, while files corrupted during
# download will not be moved from tmp_file
# (see `RemoteBASE.download()`)
self.repo.state.save(cache_file, checksum)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment getting long πŸ˜‹



def dump(dir_checksums, file_checksums, fobj, protocol=None):
"""Write specified checksums to the open file object ``fobj``."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: what are double back quotes signaling here? Just curious

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old habit from writing sphinx/rst docstrings, I'll clean them up

protocol = DEFAULT_PROTOCOL
if protocol not in SUPPORTED_PROTOCOLS:
raise DvcException(
"unsupported remote index protocol version: {}".format(protocol)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"unsupported remote index protocol version: {}".format(protocol)
"unsupported remote index protocol version: '{}'".format(protocol)

AFAIR we use ' around most dynamic output.
There's a few more instances of this.

Comment on lines +143 to +148
Args:
repo: repo for this remote index.
name: name for this index. If name is provided, this index will be
loaded from and saved to ``.dvc/tmp/index/{name}.idx``.
If name is not provided (i.e. for local remotes), this index will
be kept in memory but not saved to disk.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this go in a constructor docstring instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik standard python conventions allow constructor args to be documented in either __init__ or in the class declaration like here. I put them here because that's how it was done in some other dvc classes (see dvc/cache.py::Cache)

@shcheklein
Copy link
Member

@pmrowla looks great, but a little bit complicated :)

could you update the description, please to get the up to date list of things that we do with this PR?

pmrowla added 2 commits April 10, 2020 13:29
- there are some places in dvc where we already have separated dir/file
  checksums and some places that we do not, use update()/replace() where
  we already have them separated, and update_all()/replace_all() when we
  have to do the filtering ourself in RemoteIndex
@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 10, 2020

@pmrowla looks great, but a little bit complicated :)

could you update the description, please to get the up to date list of things that we do with this PR?

@shcheklein it's been updated.

I removed the compression stuff from the index file format since it's not needed, so the index file read/write code is simplified now.

Comment on lines 768 to 769
if removed:
self.index.invalidate()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, this is unnecessary.

gc -c will always fetch the full remote listing via all() (regardless of what is in our index). So we can treat gc -c as a full re-indexing of the remote. After gc -c our index should contain the full remote listing minus the unused checksums which were removed.

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 10, 2020

After discussion with @efiop, we decided that a flat format (whether binary or json/msgpack/etc) won't be ideal for dealing with large file/checksum counts. And we will also ideally want random access support for cases like running push/pull with a small number of files.

A portable database format like sqlite may work for our needs. Will do some testing to see what kind of runtime performance and index file size we would get with sqlite.

@@ -82,18 +106,20 @@ def _fetch_external(self, repo_url, repo_rev, files, jobs):
if is_dvc_repo:
repo.cache.local.cache_dir = self.cache.local.cache_dir
with repo.state:
cache = NamedCache()
used_cache = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, not sure I understand why this is needed.

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 13, 2020

This is going to be split into multiple PRs, the .dir checksum push/pull/gc behavior changes will come first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants