Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing cache types: data, metrics and plots, run-cache and per-file #4040

Closed
dmpetrov opened this issue Jun 14, 2020 · 2 comments
Closed
Labels
feature request Requesting a new feature p2-medium Medium priority, should be done, but less important question I have a question? research

Comments

@dmpetrov
Copy link
Member

Today we use a single remote to store all the data (which can be redefined by --remote option). However, there are different types of information in the cache:

  1. Data files
  2. Metrics and plots files
  3. Run-cache files

There might be several reasons (data sensitivity or optimizations) to store these artifacts in different remotes:

  1. Data cannot be moved outside of a company data center (outside of a country) while it's ok to use clouds (S3) for metrics and run-cache for performance or for simplifying the user experience.
  2. Data needs to be duplicated among several buckets/clouds for redundancy while it is not a requirement for metrics.
  3. Metrics and run-cache - tons of small files while data is large data files. Users might need to use separate storage for the optimization reason.
  4. It might be required in the future when some remotes might need auto-compression remote storage auto-compression #1239 or splitting data by blocks Split data into blocks  #829 while the other files don't need that.

It would be greater to introduce "types" of remote to group some of the cache types.

Proposal ideas

This is the very first iteration on the subject and I don't have a clear proposal yet. We need to take a look at some analogs and possible solutions. How I see it for now:

$ dvc remove add --default --type data mydata ssh://some/dir
$ dvc remove add --default --type metrics,plots,run-cache mymeta s3://mybucket
$ dvc remove add --default --type data,metrics,plots --redundancy mybackup ssh://backup/dir

Related subjects

Remote per file

This use case can be extended to per-file scenarios. Sometimes a special remote is required for a specific file (one bucket for data sources, the other for the pipeline derivative data sources). An extreme case - imported data sources. It would be great to have the information about remotes and use a single command dvc push to use all the remotes instead of specifying --remove option all the time.

Change remotes globally

See #2960

External workspaces

It might be also related to #3920

Random ideas

Does it make sense to use the new dvc.yaml for storing all this information or part of these?

@dmpetrov dmpetrov added question I have a question? feature request Requesting a new feature labels Jun 14, 2020
@efiop efiop added p2-medium Medium priority, should be done, but less important product: VSCode Integration with VSCode extension labels Jun 15, 2020
@pared
Copy link
Contributor

pared commented Aug 18, 2020

Remote per file

Related to #2095

@Suor
Copy link
Contributor

Suor commented Sep 10, 2020

There is also a use case where we need to separate data as well. So on top of --type their might be some glob pattern or smth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature p2-medium Medium priority, should be done, but less important question I have a question? research
Projects
None yet
Development

No branches or pull requests

6 participants