Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc: provide granularity for commands that could target specific tracked files #2458

Closed
jorgeorpinel opened this issue Sep 2, 2019 · 20 comments · Fixed by #3309
Closed

dvc: provide granularity for commands that could target specific tracked files #2458

jorgeorpinel opened this issue Sep 2, 2019 · 20 comments · Fixed by #3309
Labels
feature request Requesting a new feature p1-important Important, aka current backlog of things to do product: VSCode Integration with VSCode extension

Comments

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Sep 2, 2019

This can be seen as revisiting feature request #1026

UPDATE: Please scroll down to #2458 (comment) for most recent, summarized requirement.
Here is the original context also (still relevant):


There's different scenarios in which being able to manipulate files granularly independently of how they were committed/pushed to DVC could be useful. The problem with using dvc add -R now is that it can generate lots of .dvc files, but what if a directory could be added without -R (producing a single DVC-file) and yet other commands (lock, update, get, etc) could be applied to individual files inside the added directory tree?

Example (from iterative/dataset-registry@7476a85)

Project 1:

$ tree
.
└── tutorial
    └── nlp
        ├── Posts.xml.zip
        └── pipeline.zip
$ dvc add tutorial
...
$ dvc push
...

Project 2:

$ dvc import {project-1-url} tutorial/nlp/pipeline.zip
...
$ tree
.
├── tutorial
│   └── nlp
│       └── pipeline.zip
└── tutorial.dvc

Not sure about where the .dvc would have to be placed in this example though.

And also this is how Git works, I believe. Files are tracked individually (in fact it doesn't even recognize empty dirs).

@pared
Copy link
Contributor

pared commented Sep 2, 2019

@jorgeorpinel so, you propose storing particular files metadata inside stage file, instead of storing metadata for whole directory?

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Sep 2, 2019

I didn't think on the implementation details but that sounds reasonable... Unless there's thousand of files in there (which is possible): this could make it hard for Git to handle the DVC-file produced.

I would pull a "how does Git do it?" card here ♠️

@Suor

This comment has been minimized.

@pared
Copy link
Contributor

pared commented Sep 2, 2019

Well I think that is a bit dangerous ground:

big directory == a lot of metadata entries

which can cause problem when loading it, and also, big metadata file could be problematic to handle by git.

@shcheklein

This comment has been minimized.

@efiop
Copy link
Contributor

efiop commented Sep 2, 2019

What kind of granularity (other than the already existing -R) are you guys talking about?

@shcheklein

This comment has been minimized.

@jorgeorpinel jorgeorpinel changed the title add: should a -R-like option be the default/only behavior for dirs? add: provide granularity for commands that target files inside tracked dirs? Sep 3, 2019
@jorgeorpinel
Copy link
Contributor Author

OK I repurposed this issue for exploring about providing more granular control of files in added dirs. (No longer related to -R I think)

@efiop

What kind of granularity... are you guys talking about?

Partial syncs like Ivan mentioned. Please see the description of this issue for an example.

@dashohoxha

This comment has been minimized.

@jorgeorpinel

This comment has been minimized.

@dashohoxha

This comment has been minimized.

@efiop
Copy link
Contributor

efiop commented Sep 4, 2019

@shcheklein @jorgeorpinel Thanks for the explanation guys. Yeah, I remember those users that were asking about checking out or pulling only a particular number of files out of a directory. One of the big concerns at that time was that dvc was pretty slow with directories, which was significantly improved since then (still pretty long way to go though), so maybe it is acceptable now. But I can still see cases where someone would only want to pull one file out of the directory, and it seems to me that we could support that for most of the operations, we just need to do a few improvements to our logic.

For example, first step would be to support using paths instead of dvc files as arguments. E.g. dvc pull dir would be the same as dvc pull dir.dvc. After that, we will need to adjust the commands to understand subpaths, so that dvc pull dir/file would only download cache for file and checkout it.

So to summarize, in terms of the current architecture, granular operations are possible with dvc add dir(without -R), but we need to adjust particular dvc commands to understand granular arguments.

Related #2180

@shcheklein shcheklein added the question I have a question? label Sep 4, 2019
@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Sep 4, 2019

I like the idea, and glad to hear it doesn't seem to difficult to enable this. Using existing <stage>.dvc files as command targets (as is done now) could also be left as an option (for backward compatibility).

@efiop
Copy link
Contributor

efiop commented Sep 4, 2019

@jorgeorpinel sure, we will support both. It is also useful to have stage.dvc support when you have more than 1 output in it.

@ghost ghost added the discussion requires active participation to reach a conclusion label Oct 2, 2019
@shcheklein
Copy link
Member

Related: https://discordapp.com/channels/485586884165107732/485596304961962003/634088447858180108:

    ... 
    I sometime want to have a peak of the data on remote without download the entire dataset.
   ...

@efiop efiop mentioned this issue Nov 26, 2019
1 task
@efiop efiop added feature request Requesting a new feature p1-important Important, aka current backlog of things to do and removed discussion requires active participation to reach a conclusion question I have a question? labels Nov 26, 2019
@dmpetrov dmpetrov added the product: VSCode Integration with VSCode extension label Dec 17, 2019
@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Dec 17, 2019

From Dmitry (on Slack):

There are two general scenarios:

    • (Higher priority) Make dvc pull <file_name> work without specifying the DVC-file (where file_name is tracked) – similar to dvc get <file_name> I guess.
    • Get files from tracked data directory e.g. dvc pull data-dir/file.txt. (Requires 1.)
  • It should work for any command but these have a higher priority: pull/push/fetch/get/status.

Further rationale/motivation:
There are quite a lot of requests to this problem: a guy from ScribbleData asked personally, a request from Discord a couple months ago, I asked for this recently for dvc get command.
Also, this problem blocks automation scenarios like CD4ML: you cannot push only changed files programmatically to another remote since dvc status returns changed file names that cannot be used as an input of dvc push which requires DVC-files.

@jorgeorpinel jorgeorpinel changed the title add: provide granularity for commands that target files inside tracked dirs? add: provide granularity for commands that could target specific tracked files Dec 17, 2019
@jorgeorpinel jorgeorpinel changed the title add: provide granularity for commands that could target specific tracked files dvc: provide granularity for commands that could target specific tracked files Dec 17, 2019
@efiop efiop self-assigned this Dec 18, 2019
efiop added a commit to efiop/dvc that referenced this issue Dec 28, 2019
Using output path or a subdir/subfile path within an output now works
with `dvc push/pull/fetch/status -c` commands. Other commands don't
support the same logic for now, as there are some questions about what
should commands like `dvc remove` do when given a specific output path.

Example:
    dvc add data
    dvc pull data/subdir # will only pull files within data/subdir

Related to iterative#2458
@efiop efiop removed their assignment Jan 28, 2020
@efiop
Copy link
Contributor

efiop commented Jan 28, 2020

Next step: support for get and import.

@ghost ghost self-assigned this Jan 31, 2020
@ghost ghost removed their assignment Feb 5, 2020
@ghost ghost self-assigned this Feb 5, 2020
efiop pushed a commit that referenced this issue Feb 17, 2020
* get/import: retrieve files inside directory outs

Close #2458

* fs: move, fspath_py35 -> fspath
@jorgeorpinel
Copy link
Contributor Author

2 questions!

  1. does dvc list support this granularity? I.e. does it list directory contents? I think so but just double checking
  2. Did we ever update the docs to explicitly explain granularity in all the affected commands? I don't even remember which commands support this already 🙁

@jorgeorpinel
Copy link
Contributor Author

UPDATE: Yeah we have iterative/dvc.org/issues/886, OK I'll give it some priority

@efiop
Copy link
Contributor

efiop commented May 15, 2020

does dvc list support this granularity? I.e. does it list directory contents? I think so but just double checking

Nope :) It will soon though :)

Yeah, that doc is on me, I was planning to tackle it when I have time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature p1-important Important, aka current backlog of things to do product: VSCode Integration with VSCode extension
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants