Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ls: optimize --dvc-only #5712

Closed
skshetry opened this issue Mar 26, 2021 · 34 comments · Fixed by #7659
Closed

ls: optimize --dvc-only #5712

skshetry opened this issue Mar 26, 2021 · 34 comments · Fixed by #7659
Assignees
Labels
A: status Related to the dvc diff/list/status discussion requires active participation to reach a conclusion optimize Optimizes DVC p2-medium Medium priority, should be done, but less important performance improvement over resource / time consuming tasks product: VSCode Integration with VSCode extension

Comments

@skshetry
Copy link
Member

For dvc ls --dvc-only, we walk through all files and then filter those that are not dvc tracked files and list them. This could be optimized, and changed to just walk through dvc-tracked files from the trie we have.

@skshetry skshetry added performance improvement over resource / time consuming tasks optimize Optimizes DVC labels Mar 26, 2021
@skshetry
Copy link
Member Author

I decided to wait for fsspec changes. As discussed with @efiop, I like the idea of using listdir() to achieve this, which would be implemented with fsspec changes.

cc @isidentical

@skshetry
Copy link
Member Author

skshetry commented Mar 26, 2021

It'd be great if we could clarify what --dvc-only means. Right now, it's not actually clear.

Let's say I have a data directory in modified state, tracked with dvc, having the following entries and their status:

  1. data/foo (tracked and unchanged)
  2. data/bar (tracked but now deleted)
  3. data/foobar (recently added, not tracked, i.e. was not when dvc add was done previously)

When I do dvc list -R . --dvc-only, what should happen? Should I see all of them?
Similarly, what should happen if I do dvc list -R . data --dvc-only?

The following is the current behaviour:

$ dvc list -R . --dvc-only
data/bar
data/foo
data/foobar
$ dvc list . data -R --dvc-only
data/foo
data/foobar

cc @dberenbaum

@dberenbaum
Copy link
Collaborator

Is this a --dvc-only question? I think even without that argument, you will get data/bar in the first command and not in the second. I agree it's a consistency issue. Is it expected that tracked but deleted files ever show up in dvc list?

@mattseddon
Copy link
Member

@skshetry, @dberenbaum,

We are planning to use this functionality for the VS Code extension. We will need the list of tracked files for at least two reasons:

  1. Informing the Explorer view not to grey out these files / directories because they are not 'git ignored' (see Story: Match Git extension decoration for paths tracked by DVC vscode-dvc#207).
  2. Knowing which files / directories to watch for changes so that we can run dvc status at the appropriate times and provide the status information to our SCM view (see Represent dvc status data in VS Code API vscode-dvc#169).

Ideally we would like to see all files / directories that are being tracked by DVC appear whenever running dvc list -R . --dvc-only, we do not currently care about untracked. We would like it to be as quick as possible (can we run a "local only" option).

Do you have a rough timeframe for when the initial optimisation work will happen?

Thanks,
Matt

@dberenbaum dberenbaum added the product: VSCode Integration with VSCode extension label Apr 7, 2021
@dberenbaum
Copy link
Collaborator

Good to know, @mattseddon! We didn't have this prioritized yet, but we can try to do so soon now that we know you need it. If you have any particular timeline in mind or relative priority with other vscode requests, let us know.

@mattseddon
Copy link
Member

Thank you @dberenbaum. Within the next 8 weeks is probably soon enough for us. Will let you know if anything changes.

@mattseddon
Copy link
Member

@dberenbaum got a further requirement for this one in that we(vscode-dvc)'d like to have the option to query against the remote / cache (as well) for files/dirs that are not on disk, see this comment for a better explanation and iterative/vscode-dvc#176 for the why.

Can we track this request here or would you like a separate ticket?

@dberenbaum
Copy link
Collaborator

@mattseddon I'm not sure I follow what you're trying to do. You want to show files in the vscode file explorer view even though those files don't exist in the local workspace?

Do you actually want to list the contents of the remote, or is that a proxy for listing dvc-tracked files that are not in the workspace? If it's the latter, you should not need to access the remote. It might be better to keep this in a separate issue, but it does seem like the fix suggested by @skshetry might be a prerequisite to list every "expected" dvc-tracked file, even if that file is missing in the local workspace. Thoughts @skshetry?

@mattseddon
Copy link
Member

mattseddon commented Apr 21, 2021

@dberenbaum sorry for the confusion,

Do you actually want to list the contents of the remote, or is that a proxy for listing dvc-tracked files that are not in the workspace? If it's the latter, you should not need to access the remote.

To begin with the latter but we may need the former in the future.

After working with the command for a while I realised that @skshetry had already detailed most of what I was seeing in his earlier post. This comment details my exact findings.

It would be good to have the option to choose between showing all tracked in a specific folder and only those that are on disk. I.e dvc list . --dvc-only (shows all entries by default) vs dvc list . --dvc-only --local-only.

Let me know what you think.

Note: It seems that moving the -R before the . kind of gives me what I need but I will still have to take the immediate contents of the folder and dedup.

@skshetry
Copy link
Member Author

@dberenbaum, --dvc-only is quite an ambiguous option, I am not quite sure what it should reflect, the committed version or the to-be committed one or mix of both.

To begin with the latter but we may need the former in the future.

Regarding the use in the vscode-dvc, maybe it should use dvc list to get the current state and then use dvc status to fill in the gaps?

We might need to add more granular support for that in status though. dvc list serves a different purpose, so trying to get the state of the repo might not be a good fit there.

@dberenbaum
Copy link
Collaborator

@skshetry I'm still confused about whether any of the issues you noted are about --dvc-only or more generally about dvc list. Do we have any background on the rationale behind --dvc-only? Is there some known scenario or internal use for it?

Regarding the use in the vscode-dvc, maybe it should use dvc list to get the current state and then use dvc status to fill in the gaps?

We might need to add more granular support for that in status though. dvc list serves a different purpose, so trying to get the state of the repo might not be a good fit there.

Using status might make sense since it seems that dvc list is intended to be like ls with some additional dvc-specific fields for the files listed. However, having to use two commands to get a full description of what dvc is tracking in the repo seems cumbersome, so we probably need one command that can do it.

@skshetry
Copy link
Member Author

skshetry commented Apr 22, 2021

I'm still confused about whether any of the issues you noted are about --dvc-only or more generally about dvc list

@dberenbaum, yeah, I didn't realize that difference.

Do we have any background on the rationale behind --dvc-only?

This was never a problem before 2.0, as we always used to clone the repo and list from them, so there were no modified states. Initially, it was implemented as --outs-only, and I could find the following information in #2509 (comment).

On the other hand it can be good to show just a list of all DVC outputs. It can be done with dvc ls --recursive --outputs-only for example.


so we probably need one command that can do it.

We could extend the dvc status to list all files with their status respectively with status --granular or status --long-form or similar flags.

@dberenbaum
Copy link
Collaborator

@skshetry thanks, #2509 helps a lot! It likens dvc list to ls, but the use case is about seeing the contents of a Git repo, which isn't really like ls to me since all of the DVC-tracked files are missing from the Git repo. For that use case, it seems like dvc list should ignore the workspace and show tree contents based on what's committed in Git, using HEAD by default. ls already can show what's in the workspace, and dvc status can show what's been deleted, modified, and added, so it makes sense for dvc list to only show what's committed.

So in the examples you gave:

data/foo (tracked and unchanged)

Should be listed

data/bar (tracked but now deleted)

Should be listed

data/foobar (recently added, not tracked, i.e. was not when dvc add was done previously)

Should not be listed

Let's document this behavior better once we agree on it.


We could extend the dvc status to list all files with their status respectively with status --granular or status --long-form or similar flags.

This sounds nice to me, although it is inconsistent with git status and it's unclear how much need there is for it, so I don't think we need to rush to prioritize it. I wonder how helpful it would be for @mattseddon?


$ dvc list -R . --dvc-only
data/bar
data/foo
data/foobar
$ dvc list . data -R --dvc-only
data/foo
data/foobar

This behavior (which looks like the same that @mattseddon mentioned in iterative/vscode-dvc#176 (comment)) looks like a bug. Can we open another issue for it?

@shcheklein
Copy link
Member

A few tickets and discussions that are related to this:

#3590 (read comments as well)
#4875

It likens dvc list to ls, but the use case is about seeing the contents of a Git repo, which isn't really like ls to me since all of the DVC-tracked files are missing from the Git repo.

there was a debate about this :) I think the idea was to show the whole tree, how would ls look like after you do git clone + git checkout rev + dvc pull. Mixing DVC cached files and Git tracked made sense because dvc get or dvc import can take any file from a Git repo, not necessarily DVC tracked ones.

should ignore the workspace and show tree contents based on what's committed in Git, using HEAD by default

I guess, my take on this was here #3590 (comment) . Not a strong opinion. Just felt natural (what would most people expect).

@dberenbaum
Copy link
Collaborator

Thanks @shcheklein! Feels like we are going in circles on what dvc list . should do 🤣 , which is probably a bad signal about the UI.

I agree that having ls-like behavior for dvc list . seems more natural. Why I suggested the other approach:

  • It seems like this is a special case that's inconsistent with every other usage of dvc list. What should dvc list . --rev HEAD do then? What should dvc list . show if the .dvc file remains but the file it tracks is deleted?
  • There doesn't seem to be another command to do something like git checkout && dvc checkout && ls (unless dvc list . --rev HEAD would do so).
  • Better documenting what dvc list does could make this behavior seem more natural.

@shcheklein
Copy link
Member

In that paradigm (again, not a strong opinion):

What should dvc list . --rev HEAD do then?

it will indeed show the latest commit sate (should disregard any changes in .dvc and dvc.lock that are not committed) .. I think.

What should dvc list . show if the .dvc file remains but the file it tracks is deleted?

still show files. The whole intention to always show exact state as if you would run dvc pull. E.g. to pass it to dvc get or dvc import after that.

There doesn't seem to be another command to do something like

hmm, not sure I got the point here 🤔

Better documenting what dvc list does could make this behavior seem more natural.

💯


On a separate topic - #4875 - is it the same bug cc @mattseddon ?

@dberenbaum
Copy link
Collaborator

I also don't have a strong opinion, but I find the command confusing and think we need it to be less confusing 😄 , so I appreciate the feedback to clarify.

What should dvc list . show if the .dvc file remains but the file it tracks is deleted?

still show files. The whole intention to always show exact state as if you would run dvc pull. E.g. to pass it to dvc get or dvc import after that.

From your comment in #3590 (comment):

It means if I run dvc list . I just see the files in the current location

These seem different to me, so maybe I misinterpreted. I thought the comment from #3590 meant that dvc list . would behave just like ls in this situation and not show a deleted file. I also thought that you meant for dvc list . to be a special case that doesn't act like dvc pull.

@shcheklein
Copy link
Member

It means if I run dvc list . I just see the files in the current location

I hope I meant that we take .dvc and dvc.lock from the workspace and also Git-tracked files "as is". And create a mix of the ls . + all outputs extracted from the current .dvc and dvc.lock. Kinda ls . + virtual DVC-tracked files (or real if some of them exist in the wokspace)

I also thought that you meant for dvc list . to be a special case that doesn't act like dvc pull.

To be honest I don't remember all the context, but it seems to be that the major thing in that ticket was about clone (and we don't rely on a commit/revision) when we do dvc list . (w/o --rev). We just literally read existing state of the workspace. So, not cloning in this sense means not only an optimization, but also behavior change.

@dberenbaum
Copy link
Collaborator

So in the scenario @skshetry proposed:

data/foo (tracked and unchanged)
data/bar (tracked but now deleted)
data/foobar (recently added, not tracked, i.e. was not when dvc add was done previously)

Would you say that dvc list . data should show foo and bar since that would match what's in data.dvc?

If someone ran dvc commit, then dvc list . data should show foo and foobar?

@shcheklein
Copy link
Member

Good question :) 🤔🤔🤔🤔

If we go into showing workspace as-is (for both Git and DVC files), then it makes sense to show data as-is as well. In this case show foo and foobar (not bar).

or indeed go into dvc list . --rev=HEAD mode by default - also simple to explain, more magic to first time users I think.

@mattseddon
Copy link
Member

mattseddon commented Apr 23, 2021

We could extend the dvc status to list all files with their status respectively with status --granular or status --long-form or similar flags.

This sounds nice to me, although it is inconsistent with git status and it's unclear how much need there is for it, so I don't think we need to rush to prioritize it. I wonder how helpful it would be for @mattseddon?

The option to get a more granular status would be great, it would eliminate a lot of work needed on the vscode side and the granularity that we could then display would be useful to users.

This behavior (which looks like the same that @mattseddon mentioned in iterative/vscode-dvc#176 (comment)) looks like a bug. Can we open another issue for it?

&

On a separate topic - #4875 - is it the same bug cc @mattseddon ?

Looks different. Raised #5866

or indeed go into dvc list . --rev=HEAD mode by default - also simple to explain, more magic to first time users I think.

The behaviour of dvc list . --rev HEAD is probably what I am after in this specific case but it is currently broken.

Now feels like I was / am trying to hack the behaviour of list . to give me the current state of the repository at the last commit. The use case that I wanted to use list for specifically was 'displaying all files that are currently not on disk but available to the user'. Meaning that the user could browse through these files and potentially only pull 1 or 2 instead of 100s or 1000s. LMK if we can do this a different way.

I am going to go back and try to get the use cases that we are trying to solve now that we know what we can do (iterative/vscode-dvc#318). I think that attacking them problem from that angle might get us to a quicker resolution (for what vscode is after at least).

Thanks everyone, sorry for hijacking the issue.

P.s we do still need this optimized 😄 😬

@dberenbaum
Copy link
Collaborator

Thanks @mattseddon! That all makes sense to me, except for one thing I want to clarify:

Now feels like I was / am trying to hack the behaviour of list . to give me the current state of the repository at the last commit. The use case that I wanted to use list for specifically was 'displaying all files that are currently not on disk but available to the user'. Meaning that the user could browse through these files and potentially only pull 1 or 2 instead of 100s or 1000s. LMK if we can do this a different way.

So you want dvc status to have an option to show which specific files are missing from directories? Otherwise, it seems like dvc status already shows all files not on disk as deleted unless I'm missing something.

@dberenbaum
Copy link
Collaborator

If we go into showing workspace as-is (for both Git and DVC files), then it makes sense to show data as-is as well. In this case show foo and foobar (not bar).

or indeed go into dvc list . --rev=HEAD mode by default - also simple to explain, more magic to first time users I think.

You had me convinced and then you changed your mind 😄 !

Showing everything on disk as-is makes this completely redundant to ls, right? If it instead shows what is expected based on the dvc files in the workspace, this seems like it at least adds value and is more consistent with dvc list someurl.

What about files that are not tracked by git or dvc but are on disk in the local workspace?

@shcheklein
Copy link
Member

Showing everything on disk as-is makes this completely redundant to ls, right?

not exactly, since in all the options we discuss (at least I had in mind) we always show data/data.xml for data/data.dvc even if it's not present.

If it instead shows what is expected based on the dvc files in the workspace, this seems like it at least adds value and is more consistent with dvc list someurl.

yep, it makes sense to me. Kinda "after dvc pull" semantics?

What about files that are not tracked by git or dvc but are on disk in the local workspace?

I would probably still show them, I guess.

WDYT?

@dberenbaum
Copy link
Collaborator

it makes sense to show data as-is as well. In this case show foo and foobar (not bar).

in all the options we discuss (at least I had in mind) we always show data/data.xml for data/data.dvc even if it's not present.

This is where I'm still confused since these seem contradictory to me, unless you are suggesting that directories are a special case.

What about files that are not tracked by git or dvc but are on disk in the local workspace?

I would probably still show them, I guess.

WDYT?

To be honest, I was thinking we shouldn't show them 🤣 . I'm leaning towards keeping behavior consistent with dvc list remoteurl (except for not cloning).

@shcheklein
Copy link
Member

This is where I'm still confused since these seem contradictory to me, unless you are suggesting that directories are a special case.

No, I agree (note: a few previous example above I was wrong/had different semantics, I'm also adjusting my thinking as we go :) ) that we should treat directories and files the same way. We read the corresponding .dvc or dvc.lock entry and show the corresponding directory content or data file to those hashes. But we read them (.dvc or dvc.lock) "as-is" from the workspace, even if the are "dirty".

To be honest, I was thinking we shouldn't show them 🤣 . I'm leaning towards keeping behavior consistent with dvc list remoteurl (except for not cloning).

So, there are three cases.

  • A file or a directory that is completely new to DVC and to Git. I would show it since we have semantics to read files from the workspace, not commit.
  • A new .dvc files that is not committed into Git yet. I would show it + a corresponding directory/file (even if it is not present now).
  • Changed DVC-tracked directory/file (removed, updated, a few files added to a directory, a few files remove from a directory). Still, feels we can rely on the hash entry in the corresponding .dvc/dvc.lock and show the file/directory as defined there.

So, to summarize in a different way:

  • we rely on a dirty workspace from Git perspective
  • we rely on dvc commit-ed state from DVC perspective

It kinda dvc list remoteurl except we don't do HEAD, we do workspace. Unless I'm missing something.

@skshetry
Copy link
Member Author

skshetry commented May 26, 2021

It kinda dvc list remoteurl except we don't do HEAD, we do workspace.

So, what you are saying is to show both workspace + dvc commit, right?
Now, I think we should also clarify what --dvc-only means? Does that mean dvc commit or should it also include new updates within the output?

@dberenbaum
Copy link
Collaborator

@skshetry Do you have an example to clarify?

I'm trying to remember the whole conversation above because it's been awhile.

Let's say we start with a clean dvc repo. Here's what I would expect:

$ mkdir data
$ echo foo > data/foo
$ echo bar > data/bar
$ dvc list -R . # add files not yet tracked by dvc or git - show additions but exclude from --dvc-only
data/foo
data/foobar
$ dvc list -R . --dvc-only
$ dvc add data
$ dvc list -R . # added data to dvc but not yet to git - show data
data/bar
data/foo
$ dvc list -R . --dvc-only
data/bar
data/foo
$ git add .
$ git commit -m "add data"
$ echo foobar > data/foobar
$ rm data/bar
$ dvc list -R . # edited dvc-tracked data but haven't committed to dvc or git - ignore edits
data/bar
data/foo
$ dvc list -R . --dvc-only
data/bar
data/foo
$ dvc commit
$ dvc list -R . # edited dvc-tracked data and committed to dvc but not git - show edited data
data/foo
data/foobar
$ dvc list -R . --dvc-only
data/foo
data/foobar
$ rm data.dvc # deleted .dvc file - show data in workspace but exclude from dvc-only
$ dvc list -R .
data/foo
data/foobar
$ dvc list -R . --dvc-only

Sorry for the lengthy example, but hopefully this will help avoid ambiguity. Does that all seem consistent with your understanding @skshetry @shcheklein? Any questions?

@dberenbaum
Copy link
Collaborator

Need to reconsider this so that it's consistent with upcoming dvc status updates in https://www.notion.so/iterative/Consolidate-repo-status-ed3cd60f706f4fcaba1d3f3cac1498e9.

cc @efiop

@dberenbaum
Copy link
Collaborator

Okay, reviewing this again, it seems consistent with proposed dvc status changes.

Another way to summarize the desired behavior for dvc ls -R [--dvc-only] .: return the same as git add; git commit; dvc list -R [--dvc-only] --rev HEAD ..

@dberenbaum
Copy link
Collaborator

Do we expect to update this along with dvc status @skshetry @efiop? Should we keep it p1?

@daavoo daavoo added the A: status Related to the dvc diff/list/status label Feb 22, 2022
@efiop efiop added p2-medium Medium priority, should be done, but less important and removed p1-important Important, aka current backlog of things to do labels Apr 6, 2022
@dberenbaum
Copy link
Collaborator

@skshetry Do you think we should move the rest of the discussion about expected dvc list behavior to another issue? I don't believe that's all addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: status Related to the dvc diff/list/status discussion requires active participation to reach a conclusion optimize Optimizes DVC p2-medium Medium priority, should be done, but less important performance improvement over resource / time consuming tasks product: VSCode Integration with VSCode extension
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants