-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ls: optimize --dvc-only #5712
Comments
I decided to wait for fsspec changes. As discussed with @efiop, I like the idea of using cc @isidentical |
It'd be great if we could clarify what Let's say I have a
When I do The following is the current behaviour: $ dvc list -R . --dvc-only
data/bar
data/foo
data/foobar
$ dvc list . data -R --dvc-only
data/foo
data/foobar cc @dberenbaum |
Is this a |
We are planning to use this functionality for the VS Code extension. We will need the list of tracked files for at least two reasons:
Ideally we would like to see all files / directories that are being tracked by DVC appear whenever running Do you have a rough timeframe for when the initial optimisation work will happen? Thanks, |
Good to know, @mattseddon! We didn't have this prioritized yet, but we can try to do so soon now that we know you need it. If you have any particular timeline in mind or relative priority with other vscode requests, let us know. |
Thank you @dberenbaum. Within the next 8 weeks is probably soon enough for us. Will let you know if anything changes. |
@dberenbaum got a further requirement for this one in that we( Can we track this request here or would you like a separate ticket? |
@mattseddon I'm not sure I follow what you're trying to do. You want to show files in the vscode file explorer view even though those files don't exist in the local workspace? Do you actually want to list the contents of the remote, or is that a proxy for listing dvc-tracked files that are not in the workspace? If it's the latter, you should not need to access the remote. It might be better to keep this in a separate issue, but it does seem like the fix suggested by @skshetry might be a prerequisite to list every "expected" dvc-tracked file, even if that file is missing in the local workspace. Thoughts @skshetry? |
@dberenbaum sorry for the confusion,
To begin with the latter but we may need the former in the future. After working with the command for a while I realised that @skshetry had already detailed most of what I was seeing in his earlier post. This comment details my exact findings. It would be good to have the option to choose between showing all tracked in a specific folder and only those that are on disk. I.e Let me know what you think. Note: It seems that moving the |
@dberenbaum,
Regarding the use in the vscode-dvc, maybe it should use We might need to add more granular support for that in |
@skshetry I'm still confused about whether any of the issues you noted are about
Using |
@dberenbaum, yeah, I didn't realize that difference.
This was never a problem before 2.0, as we always used to clone the repo and list from them, so there were no modified states. Initially, it was implemented as
We could extend the |
@skshetry thanks, #2509 helps a lot! It likens So in the examples you gave:
Should be listed
Should be listed
Should not be listed Let's document this behavior better once we agree on it.
This sounds nice to me, although it is inconsistent with
This behavior (which looks like the same that @mattseddon mentioned in iterative/vscode-dvc#176 (comment)) looks like a bug. Can we open another issue for it? |
A few tickets and discussions that are related to this: #3590 (read comments as well)
there was a debate about this :) I think the idea was to show the whole tree, how would
I guess, my take on this was here #3590 (comment) . Not a strong opinion. Just felt natural (what would most people expect). |
Thanks @shcheklein! Feels like we are going in circles on what I agree that having
|
In that paradigm (again, not a strong opinion):
it will indeed show the latest commit sate (should disregard any changes in .dvc and dvc.lock that are not committed) .. I think.
still show files. The whole intention to always show exact state as if you would run
hmm, not sure I got the point here 🤔
💯 On a separate topic - #4875 - is it the same bug cc @mattseddon ? |
I also don't have a strong opinion, but I find the command confusing and think we need it to be less confusing 😄 , so I appreciate the feedback to clarify.
From your comment in #3590 (comment):
These seem different to me, so maybe I misinterpreted. I thought the comment from #3590 meant that |
I hope I meant that we take
To be honest I don't remember all the context, but it seems to be that the major thing in that ticket was about clone (and we don't rely on a commit/revision) when we do |
So in the scenario @skshetry proposed:
Would you say that If someone ran |
Good question :) 🤔🤔🤔🤔 If we go into showing workspace as-is (for both Git and DVC files), then it makes sense to show data as-is as well. In this case show or indeed go into |
The option to get a more granular status would be great, it would eliminate a lot of work needed on the vscode side and the granularity that we could then display would be useful to users.
&
Looks different. Raised #5866
The behaviour of Now feels like I was / am trying to hack the behaviour of I am going to go back and try to get the use cases that we are trying to solve now that we know what we can do (iterative/vscode-dvc#318). I think that attacking them problem from that angle might get us to a quicker resolution (for what vscode is after at least). Thanks everyone, sorry for hijacking the issue. P.s we do still need this optimized 😄 😬 |
Thanks @mattseddon! That all makes sense to me, except for one thing I want to clarify:
So you want |
You had me convinced and then you changed your mind 😄 ! Showing everything on disk as-is makes this completely redundant to What about files that are not tracked by git or dvc but are on disk in the local workspace? |
not exactly, since in all the options we discuss (at least I had in mind) we always show
yep, it makes sense to me. Kinda "after
I would probably still show them, I guess. WDYT? |
This is where I'm still confused since these seem contradictory to me, unless you are suggesting that directories are a special case.
To be honest, I was thinking we shouldn't show them 🤣 . I'm leaning towards keeping behavior consistent with |
No, I agree (note: a few previous example above I was wrong/had different semantics, I'm also adjusting my thinking as we go :) ) that we should treat directories and files the same way. We read the corresponding
So, there are three cases.
So, to summarize in a different way:
It kinda |
So, what you are saying is to show both workspace + dvc commit, right? |
@skshetry Do you have an example to clarify? I'm trying to remember the whole conversation above because it's been awhile. Let's say we start with a clean dvc repo. Here's what I would expect: $ mkdir data
$ echo foo > data/foo
$ echo bar > data/bar
$ dvc list -R . # add files not yet tracked by dvc or git - show additions but exclude from --dvc-only
data/foo
data/foobar
$ dvc list -R . --dvc-only
$ dvc add data
$ dvc list -R . # added data to dvc but not yet to git - show data
data/bar
data/foo
$ dvc list -R . --dvc-only
data/bar
data/foo
$ git add .
$ git commit -m "add data"
$ echo foobar > data/foobar
$ rm data/bar
$ dvc list -R . # edited dvc-tracked data but haven't committed to dvc or git - ignore edits
data/bar
data/foo
$ dvc list -R . --dvc-only
data/bar
data/foo
$ dvc commit
$ dvc list -R . # edited dvc-tracked data and committed to dvc but not git - show edited data
data/foo
data/foobar
$ dvc list -R . --dvc-only
data/foo
data/foobar
$ rm data.dvc # deleted .dvc file - show data in workspace but exclude from dvc-only
$ dvc list -R .
data/foo
data/foobar
$ dvc list -R . --dvc-only Sorry for the lengthy example, but hopefully this will help avoid ambiguity. Does that all seem consistent with your understanding @skshetry @shcheklein? Any questions? |
Need to reconsider this so that it's consistent with upcoming cc @efiop |
Okay, reviewing this again, it seems consistent with proposed Another way to summarize the desired behavior for |
@skshetry Do you think we should move the rest of the discussion about expected |
For
dvc ls --dvc-only
, we walk through all files and then filter those that are not dvc tracked files and list them. This could be optimized, and changed to just walk through dvc-tracked files from the trie we have.The text was updated successfully, but these errors were encountered: