Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

status: add --dvc-only & --outs-only flags #5895

Closed
mattseddon opened this issue Apr 29, 2021 · 13 comments
Closed

status: add --dvc-only & --outs-only flags #5895

mattseddon opened this issue Apr 29, 2021 · 13 comments
Assignees
Labels
discussion requires active participation to reach a conclusion product: VSCode Integration with VSCode extension

Comments

@mattseddon
Copy link
Member

My current understand is that the status command is focused on pipelines. In the VS Code project we currently only need information on the dvc tracked files (for use in our custom SCM view and file decoration).

Ideally, we would like to have a way to limit the output to changed outs which are tracked by DVC.

My suggestion for doing this would be to add two flags --dvc-only and --outs-only.

It would also be good to change the shape of the output (in this instance) to return either a list of dicts. I.e:

[
        {
		"path": "/some/tracked/dir",
		"status": "modified"
	},
	{
		"path": "/some/tracked/file",
		"status": "deleted"
	},
        ...
]

or one giant dict:

{
	"/some/tracked/dir": "modified",
	"/some/tracked/file": "Deleted"
        ...
}

Each have their own benefits, happy to discuss and / or work around what you guys think is best.

Also, I am aware that adding flags and changing the shape of status might not be the right answer and in fact we need a new command. Very much open to that as well. Just wanted to get a discussion started.

Please reach out to me if you need any clarification at all.

Thank you

@mattseddon mattseddon changed the title status: add --dvc-only --outs-only flags status: add --dvc-only --outs-only flags Apr 29, 2021
@mattseddon mattseddon changed the title status: add --dvc-only --outs-only flags status: add --dvc-only --outs-only flags Apr 29, 2021
@mattseddon mattseddon changed the title status: add --dvc-only --outs-only flags status: add --dvc-only --outs-only flags Apr 29, 2021
@mattseddon mattseddon changed the title status: add --dvc-only --outs-only flags status: add --dvc-only --outs-only flags Apr 29, 2021
@mattseddon mattseddon changed the title status: add --dvc-only --outs-only flags status: add --dvc-only & --outs-only flags Apr 29, 2021
@dberenbaum dberenbaum added the product: VSCode Integration with VSCode extension label Apr 29, 2021
@dberenbaum
Copy link
Collaborator

What about dvc diff?

@mattseddon
Copy link
Member Author

Hey @dberenbaum,

I've checked out diff and it seems to provide what we need but (and this is a big but) the performance of diff is even worse than that of the list. That is true even if I specify --targets.

It also provides more granular support for directories which we have decided to not pursue right now.

Do you think the performance of diff could be considerably improved? Could we also add a flag to not descend into directories?

Thanks,

@dberenbaum
Copy link
Collaborator

I think your initial instinct was probably right that status is a better starting point if you aren't interested in the granular directories.

It seem like there are two ways in which dvc status fails to meet your needs for now:

  1. The format of the output isn't great because it includes stage/dependency info you don't need and is organized by stages instead of files.
  2. It's slow.

Are both of these correct, and is one a bigger issue than the other?

@mattseddon
Copy link
Member Author

That is correct. Both points are equally important and interrelated.

What I mean by this is that as the return format is not what we need we have to post-process which makes the performance worse.

Hope this helps, let me know if you need anything further.

As we move forwards I'm going to try and push as much "data transformation" as possible upstream and back into the cli. Would it be helpful if I first ask the question in a ticket like this and then start to make contributions? Or would you rather that the core team focuses on making these changes?

Could be a question for @shcheklein as well but LMK what you think.

@dberenbaum
Copy link
Collaborator

Yeah, we might need to step back and think about the approach here because unlike #5712 , it seem like this (and other issues like #5881) no longer really generate value for regular users of dvc (and may actually add confusion by introducing extra flags/commands).

Would it be possible and make sense to use Python to directly access the dvc api? It wouldn't take much to get what you're asking for here if it doesn't need to be tied to a CLI output:

>>> import collections, itertools
>>> from dvc.api import Repo
>>> outs = itertools.chain.from_iterable(stage.outs for stage in Repo().stages)
>>> status_list = [out.status() for out in outs if out.status()]
>>> status_list
[{'data/features': 'modified'}, {'model.pkl': 'modified'}, {'data/data.xml': 'deleted'}]
>>> status_dict = dict(collections.ChainMap(*status_list))
>>> status_dict
{'data/data.xml': 'deleted', 'model.pkl': 'modified', 'data/features': 'modified'}

@shcheklein
Copy link
Member

Would it be possible and make sense to use Python to directly access the dvc api?

No, we can't use DVC API directly (it's JS to Python). And since there is no DVC API yet, it would make the extension depend on the implementation details? Probably, not the best thing to have when (unlike Studio) we don't have that much of a control to fix it quick.

no longer really generate value for regular users of dvc

Fair enough, how about providing additional info in the JSON response so that we can distinguish DVC-tracked and Git-tracked outputs? That should be enough to start, @mattseddon ? Also, status already accepts targets, so the only question is - what does it return now and is it optimized to avoid collecting the whole DAG for example?

Besides, of course, reviewing the whole output one more time before the release - since it's the first time we'll start to depend on it.

@mattseddon
Copy link
Member Author

No, we can't use DVC API directly (it's JS to Python). And since there is no DVC API yet, it would make the extension depend on the implementation details? Probably, not the best thing to have when (unlike Studio) we don't have that much of a control to fix it quick.

I also don't think the extension would be the right place to put this implementation.

Fair enough, how about providing additional info in the JSON response so that we can distinguish DVC-tracked and Git-tracked outputs? That should be enough to start, @mattseddon ? Also, status already accepts targets, so the only question is - what does it return now and is it optimized to avoid collecting the whole DAG for example?

Besides, of course, reviewing the whole output one more time before the release - since it's the first time we'll start to depend on it.

Any improvement is good and I'm happy to take small steps in the right direction. However, there is one other thing that I ran into this week that lends itself to use needing to take a different approach, I mentioned it on a PR but it's more applicable to bring it up here 👇🏻

Should we be concerned that status does not show Added. Here is a recording of me creating a new folder with two files, running add on the folder and the status not showing up in status (it does however show up in diff):

Screen.Recording.2021-05-03.at.11.55.49.am.mov

I see these as the current options for how to proceed (long term):

  1. Optimise diff (IMO will help users).
  2. Significant changes to status (IMO won't help users).
  3. Implement a new function that is just for the extension (IMO won't help cli users - at least not at first).

Would be good to chat through what you guys think, definitely something to talk about in planning @shcheklein

@dberenbaum dberenbaum added the discussion requires active participation to reach a conclusion label May 4, 2021
@dberenbaum
Copy link
Collaborator

Also, status already accepts targets, so the only question is - what does it return now and is it optimized to avoid collecting the whole DAG for example?

Yes, it looks to me like it is optimized to collect info for a specific target only. @skshetry can confirm.

Does that make the other questions irrelevant for now? I'm lost on whether dvc list --dvc-only (to get dvc-tracked outputs) combined with dvc status [target](to get changes) might be sufficient for now, or whether it's immediately necessary to have dvc status handle both showing what has changed and which changes are dvc-tracked outputs. Of course, I understand this is still a WIP and needs may evolve.

@shcheklein
Copy link
Member

running add on the folder and the status not showing up in status

It's expected from the DVC perspective. It's a really good questions to show those (dvc diff semantics) or not (dvc status semantics) in VS Code. But let's discuss this in the VS Code meetings/channels.

combined with dvc status [target](to get changes) might be sufficient for now

it might be enough to try for now, I think. @mattseddon can correct me if I'm wrong here.

@dberenbaum
Copy link
Collaborator

Sorry, @mattseddon, I somehow missed your comment. As @shcheklein said, that's a great point.

It's probably not obvious enough that diff and status are comparing different things. By default, diff compares the workspace to the last commit, and status compares the workspace to the cache. Here are the high-level explanations from the docs:

diff
Show added, modified, or deleted DVC-tracked files and directories between two commits in the DVC repository, or between a commit and the workspace.

status
Show changes in the project pipelines, as well as file mismatches either between the cache and workspace, or between the cache and remote storage.

If the DVC SCM view is about showing which files have changes that have not yet been tracked/cached by DVC, then I think you want status over diff. diff is more focused on changes tracked by Git.

@dberenbaum
Copy link
Collaborator

@mattseddon No rush since I know you are still probably working through the design, but whenever you are confident that you do or don't need this feature, let us know so we can either prioritize or close this.

@mattseddon
Copy link
Member Author

Thanks @dberenbaum. I have been working in the experiments space this week and we haven't done any further review on the trees but will let you know (or close) when we get back to it.

@dberenbaum dberenbaum self-assigned this May 18, 2021
@mattseddon
Copy link
Member Author

@dberenbaum based on iterative/vscode-dvc#318 (comment) I don't think that this is either a priority or even a want for the vs code extension at the moment. If anything we would probably want to make changes to diff. I'll get back to you with further requirements once the product discussions have finished 👍🏻 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion requires active participation to reach a conclusion product: VSCode Integration with VSCode extension
Projects
None yet
Development

No branches or pull requests

3 participants