-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dvc: provide granularity for commands that could target specific tracked files #2458
Comments
@jorgeorpinel so, you propose storing particular files metadata inside stage file, instead of storing metadata for whole directory? |
I didn't think on the implementation details but that sounds reasonable... Unless there's thousand of files in there (which is possible): this could make it hard for Git to handle the DVC-file produced. I would pull a "how does Git do it?" card here |
This comment has been minimized.
This comment has been minimized.
Well I think that is a bit dangerous ground: big directory == a lot of metadata entries which can cause problem when loading it, and also, big metadata file could be problematic to handle by git. |
This comment has been minimized.
This comment has been minimized.
What kind of granularity (other than the already existing -R) are you guys talking about? |
This comment has been minimized.
This comment has been minimized.
OK I repurposed this issue for exploring about providing more granular control of files in added dirs. (No longer related to
Partial syncs like Ivan mentioned. Please see the description of this issue for an example. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@shcheklein @jorgeorpinel Thanks for the explanation guys. Yeah, I remember those users that were asking about checking out or pulling only a particular number of files out of a directory. One of the big concerns at that time was that dvc was pretty slow with directories, which was significantly improved since then (still pretty long way to go though), so maybe it is acceptable now. But I can still see cases where someone would only want to pull one file out of the directory, and it seems to me that we could support that for most of the operations, we just need to do a few improvements to our logic. For example, first step would be to support using paths instead of dvc files as arguments. E.g. So to summarize, in terms of the current architecture, granular operations are possible with Related #2180 |
I like the idea, and glad to hear it doesn't seem to difficult to enable this. Using existing |
@jorgeorpinel sure, we will support both. It is also useful to have |
Related: https://discordapp.com/channels/485586884165107732/485596304961962003/634088447858180108:
|
There are two general scenarios:
|
Using output path or a subdir/subfile path within an output now works with `dvc push/pull/fetch/status -c` commands. Other commands don't support the same logic for now, as there are some questions about what should commands like `dvc remove` do when given a specific output path. Example: dvc add data dvc pull data/subdir # will only pull files within data/subdir Related to iterative#2458
Next step: support for get and import. |
* get/import: retrieve files inside directory outs Close #2458 * fs: move, fspath_py35 -> fspath
2 questions!
|
UPDATE: Yeah we have iterative/dvc.org/issues/886, OK I'll give it some priority |
Nope :) It will soon though :) Yeah, that doc is on me, I was planning to tackle it when I have time. |
UPDATE: Please scroll down to #2458 (comment) for most recent, summarized requirement.
Here is the original context also (still relevant):
There's different scenarios in which being able to manipulate files granularly independently of how they were committed/pushed to DVC could be useful. The problem with using
dvc add -R
now is that it can generate lots of.dvc
files, but what if a directory could be added without-R
(producing a single DVC-file) and yet other commands (lock, update, get, etc) could be applied to individual files inside the added directory tree?Example (from iterative/dataset-registry@7476a85)
Project 1:
Project 2:
And also this is how Git works, I believe. Files are tracked individually (in fact it doesn't even recognize empty dirs).
The text was updated successfully, but these errors were encountered: