Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide on basic use cases for explorer tree and scm views #318

Closed
mattseddon opened this issue Apr 23, 2021 · 13 comments
Closed

Decide on basic use cases for explorer tree and scm views #318

mattseddon opened this issue Apr 23, 2021 · 13 comments
Assignees
Labels
A: extension Area: core extension 🎨 design Needs design input or is being actively worked on discussion product PR that affects product

Comments

@mattseddon
Copy link
Member

We now have a couple of working prototype views available to the extension with respect to status & LFS. The questions for each of the two views are now:

  1. What pain point(s) are we trying to solve?
  2. Are these common enough use cases to be included in the first release?
  3. What is missing from what we already have?

Here are more detailed descriptions of each of the views and my answers to the above:

A scm view (#169):

Screen.Recording.2021-04-13.at.2.51.30.pm.mov

We can currently add new, push and pull modified or pull deleted.

Answers:

  1. Manage files / directories that are untracked by both git and dvc or tracked by dvc.
  2. Yes.
  3. Enhanced support for changes made within tracked directories and ????

A custom explorer tree view (#176):

Screen.Recording.2021-04-23.at.1.01.54.pm.mov

For this view there are a couple of caveats. The current behaviour of list . does not support the main use case that we are trying to support by implementing it. Also, we could try and "stuff" this behaviour into the scm view in some form or another (but I don't think that is a great idea).

Answers:

  1. Display all files that are not on disk but exist in the remote & ???
  2. If we could survive with only showing tracked directories that are missing from disk then i'd say no but I am unsure as to how often this situation comes up for users. How often do they have their workspace in a state that is mismatched with the remote / files are missing? If it is a lot then we should solve.
  3. Support the basic use case if we want it for the first release.

@shcheklein @rogermparent @dberenbaum would love to get your input on this one, is there anyone else that we should rope in?

cc @yalozhkin

Thanks

@mattseddon mattseddon added discussion A: extension Area: core extension 🎨 design Needs design input or is being actively worked on product PR that affects product LFS labels Apr 23, 2021
@mattseddon mattseddon self-assigned this Apr 23, 2021
@dberenbaum
Copy link
Contributor

IMO having file-level info about directories and more granular ways to manage them would be really helpful. See iterative/dvc#4657 (comment) and the rest of that issue as an example of how it might be useful. However, I don't know that is should be a blocker for v1 of the vscode extension because it might be a large undertaking to do it right and make directories more flexible.

@shcheklein
Copy link
Member

Thanks Matt, good summary and videos help to get thoughts together. Overall, on the #2, since it looks it's a continuation of that story:

Primary things that we had in mind for the first release.

  • Visibility When you open a project right after you clone it you don't see data, models, etc. E.g. when I open this one https://github.com/iterative/example-get-started - how do I know where the model.pkl is located? What datasets do I have?

  • Actions (granularity, convenience compared to CLI): After I see the files (data, intermediate, etc) - I want to be able to pull/push/checkout, etc only some of those, not everything at once.

  • Situation awareness When I switch to a different branch I want to see artifacts that are out of sync (and run dvc checkout to clean it).

  • Domain-specific navigation would be very beneficial to see eventually different versions of a dataset (on a right click?). Do something like git checkout dataset.dvc <revison> && dvc checkout dataset.dvc.


Now, on these two trees.

My take from what I see. Both trees make sense. SCM should focus on the changes only (completely missing data directories or files is an edge case because it can mean either delete or not before dvc pull state).

Main tree can focus on what we have - a list of datasets, models, intermediate things, etc.

In both tree we don't want (for now) to synchronously do cloud check (dvc status -c). We can rely on the local state. Analyze .dvc + dvc.lock files.


Some specific comments to the PR description:

We can currently add new, push and pull modified or pull deleted.

👍 probably, we should add dvc commit as well. pull deleted == pull absent == pull modified, right?

but exist in the remote

not need to check an actual remote storage for now, we can rely on the local state


Now, what is missing from what we already have?

I would need to review and play with it a bit. And come back to you. But some ideas we can get from the list above ^^. Also the requirement we'll have is to make reasonably responsive.

@mattseddon
Copy link
Member Author

@shcheklein a couple of extra questions from me:

  1. Do we want the commit action in both trees? I would have thought only in the SCM view would make the most sense?
  2. For my own understanding: What is the difference between dvc pull and dvc checkout. Is the difference that checkout will only check the cache and not storage? Would you generally run checkout instead of pull if you believe your cache is up to date?
  3. How should we represent statuses that are listed as 'always changed'? They are currently completely excluded. This can be an issue because we have folders like logs which are always marked as 'always changed' so if they get deleted then we can end up in a weird state.

@shcheklein
Copy link
Member

Do we want the commit action in both trees? I would have thought only in the SCM view would make the most sense?

sounds reasonable to me, our case though that we have a bit different semantics. Question though - does Git extension had actions in the projects tree? Anyways, we can start with SCM only and iterate.

Would you generally run checkout instead of pull if you believe your cache is up to date?

yes. dvc pull == dvc fetch + dvc checkout. dvc checkout all has some options like --relink to do some specific cache<-->workspace operations/adjustments.

dvc checkout runs faster since it doesn't have to do sync cache<--->remote. Even thing like dvc status -c (dvc pull includes it naturally) can be quite expensive. dvc checkout avoids all of this. I think it will complain if some files are missing in your cache. And it's up to you to run pull or not (maybe it's enough to do a partial checkout).

How should we represent statuses that are listed as 'always changed'?

let's ignore this. This is mostly about the pipelines functionality (when .dvc contains a command to run, and some other stage that depend on it, etc). It's not so much about data management.

@mattseddon
Copy link
Member Author

Consolidation of current functionality into text along with product questions

Tracked Tree View

Current functionality:

  • navigate through tree
  • open files - if file is a binary then a warning will be displayed

[Q] Do we want to add the ability to remove targets using this view?

Demo:

Screen.Recording.2021-05-06.at.3.40.38.pm.mov

SCM View

Current Functionality:

  • checkout repository
  • commit repository
  • add/push/pull/commit/checkout targets

Refresher on statuses:

  • new: An output is found in the workspace, but there is no corresponding file hash saved in the dvc.lock or .dvc file yet.
  • modified: An output or dependency is found in the workspace, but the corresponding file hash in the dvc.lock or .dvc file is not up to date.
  • deleted: The output or dependency is referenced in a dvc.lock or .dvc file, but does not exist in the workspace.
  • not in cache: An output exists in the workspace, and the corresponding file hash in the dvc.lock or .dvc file is up to date, but there is no corresponding cache file or directory.
  • untracked - neither tracked by git nor DVC.

List of commands with enabled statuses:

  • add - untracked
  • checkout - deleted, modified
  • commit - modified, new, not in cache
  • push - deleted, modified, new,
  • pull - deleted, modified, new, not in cache

[Q] Should we be able to

  1. commit not in cache
  2. push deleted
  3. pull or commit new (based on above definition)

(Quick) Demo:

Screen.Recording.2021-04-29.at.10.40.30.am.mov

File Decorations

File / directory decorations are provided to both of the above views and the built in explorer. I have tried to match our custom decoration to that of the built in git extension. Here is the current mapping:

status badge color tooltip
deleted D gitDecoration.deletedResourceForeground DVC deleted
modified M gitDecoration.modifiedResourceForeground DVC modified
new A gitDecoration.addedResourceForeground DVC added
not in cache NC gitDecoration.renamedResourceForeground DVC not in cache
tracked DVC tracked

Note: untracked is taken care of by the git extension's decorations. We need to override the files that are DVC tracked because they receive a "greyed out" decoration on account of being git ignored.

[Q] What is missing from the above? Could we open this up to a wider audience?

@shcheklein
Copy link
Member

@mattseddon

Thanks for the comprehensive summary, it helped a lot!

Tracked Tree View

[Q] Do we want to add the ability to remove targets using this view?

I saw that you already started doing that. If it's not too late I would postpone it for now. I still (even after going a few times, and trying it on multiple projects) don't understand the best focus for the Tracked Tree View.

A few question on my end (I'll check some of those things, but you might have some ideas already):

  • what happens with the regular project tree if directory contains many files?
  • can we add the right click context menu to the items? can we hook into the menu for the regular project tree?

SCM View

[Q] Should we be able to: commit not in cache

Yes, it'll essentially add them into cache.

[Q] Should we be able to: push deleted

Yes, let's do this for now. I think it's allowed in DVC, so, let's allow it here also.

[Q] Should we be able to: pull or commit new (based on above definition)

could you clarify please, how do we derive this status now?

I see this An output is found in the workspace, but there is no corresponding file hash saved in the dvc.lock or .dvc file yet., but I'm not sure we can call this new/added.

[Q] What is missing from the above?

Some thoughts:

  • push for target doesn't make much sense (since we don't focus the SCM tree view now on the difference between remote and local). It won't be visible when there are not discrepancies in the workspace/cache.
  • considering that, should we make push/pull part of the tracked tree view where we always show the full list of objects?

An important point that I'm still wrapping my mind around (since it goes beyond from what DVC is providing right now, all DVC commands like dvc status, etc haven't been touched for a while I would never let them drive our decisions here): when data.dvc is aligned with data (no changes from DVC perspective), but data.dvc is modified from Git perspective (new or changed). In this case we show data.dvc as modified, but data not. While it should be to my mind. It is a dataset modification from the previous commit. That's what is important for end users to see.

So, essentially, for every DVC tracked artifact there are two points of change - .dvc-file (or corresponding entry in the dvc.lock) and an artifact/dataset itself. DVC commands (dvc status, dvc list?) focus mostly on the discrepancy between .dvc-file and data. While they don't analyze changes to the .dvc-file from the previous commit.

It explains that struggle with us unable to show A properly. data.dvc-file is sync with data after dvc add data - thus we don't show it DVC SCM tree. We don't show A in the DVC tracked tree. Essentially, users will see only some changes to metadata (.dvc and/or dvc.lock) which is probably not what we want.

If we start taking into account changes to .dvc-files (and dvc.lock entries) from the git status (or git diff) we can now do for example:

  • A - .dvc-file is new (or new entry in dvc.lock)
  • M - either data.dvc modified or data is modified (same for dvc.lock entries and corresponding files)
  • D - actually deleted entry (both .dvc-file and data it point to)

(we can detect also some DVC-specific cases - like Not pulled, it's an edge case of the M when data is not present)

Those are at least in the DVC Tracked View.

How about SCM tree in this case? How should it complement the existing one? What should it focus on?

Probably, mostly by helping to git commit things (.dvc-files or parts of the dvc.lock). All other commands that effectively solve the ".dvc-file <--> data" discrepancy should go into DVC Tracked Tree.

File Decorations

Note: untracked is taken care of by the git extension's decorations. We need to override the files that are DVC tracked because they receive a "greyed out" decoration on account of being git ignored.

💯, create a checkbox in the review?


[Q] What is missing from the above? Could we open this up to a wider audience?

Yes, I'll start the process as we discussed. I'm worried a bit about experiments table - it's getting there, but can not be considered MVP yet. Anyways, I think it's good to start looking for the core users who will be testing and building this with us.

@mattseddon
Copy link
Member Author

Tracked Tree View

what happens with the regular project tree if directory contains many files?

All of the files get rendered when you open that particular folder, there should be no limit placed on the number of files available to a folder, it just might the case that VS Code nearly dies (or actually does) when you open that folder.

can we add the right click context menu to the items? can we hook into the menu for the regular project tree?

AFAIK this is not possible. Did some further research this morning and it looks unlikely. We will have to recreate / build all of the functions that we want to show in this tree.

SCM View

[Q] Should we be able to: pull or commit new (based on above definition)

could you clarify please, how do we derive this status now?
I see this An output is found in the workspace, but there is no corresponding file hash saved in the dvc.lock or .dvc file yet., but I'm not sure we can call this new/added.

Currently I just take what dvc status provides. If it says that the file is "new" then we'll give it that status. I think I understand now that this isn't really what we want.

  • push for target doesn't make much sense (since we don't focus the SCM tree view now on the difference between remote and local). It won't be visible when there are not discrepancies in the workspace/cache.

In this instance I copied what the git extension does. All of this is currently available in the context menu:

image

and when there is more than one repository available in the workspace you get this view:

image

You can sync inline which then provides this popup:

image

I do think there is a place to push / pull to / from the remote in this view, perhaps only at the top level? That would more closely match what git provides.

  • considering that, should we make push/pull part of the tracked tree view where we always show the full list of objects?

We can then move the inline commands to a right click menu inside the tree view (as discussed on call).

If we start taking into account changes to .dvc-files (and dvc.lock entries) from the git status (or git diff) we can now do for example:

  • A - .dvc-file is new (or new entry in dvc.lock)
  • M - either data.dvc modified or data is modified (same for dvc.lock entries and corresponding files)
  • D - actually deleted entry (both .dvc-file and data it point to)

(we can detect also some DVC-specific cases - like Not pulled, it's an edge case of the M when data is not present)

Those are at least in the DVC Tracked View.

How about SCM tree in this case? How should it complement the existing one? What should it focus on?

Probably, mostly by helping to git commit things (.dvc-files or parts of the dvc.lock). All other commands that effectively solve the ".dvc-file <--> data" discrepancy should go into DVC Tracked Tree.

Please correct me if I have oversimplified this. The way that I have been thinking about this is that the we want to help users navigate between the previous commit and the next one. I.e. if results are bad then I want to go back, if they are good then I want to commit and move forwards. We are lucky in that we get all of the git changes displayed for "free" in the SCM view right next to the DVC ones.

To me there are two sets of changes available in the workspace, the changes tracked by git and then on top of those we have the changes tracked by DVC. We have this equation:

prev git commit + changes tracked by git + changes tracked by dvc = current workspace

Here is an example of what is displayed when I arbitrarily change some of the information within the demo project's data/MNIST/raw.dvc data placeholder (the md5 hash is "not in cache" after a random change):

image

In order to get back to the previous git commit we would need to run git reset --hard HEAD (- git changes)
& dvc checkout (- dvc changes [to be sure we have squashed any changes in the corresponding directory]).

In order to get us to the next commit we would need to dvc commit & git commit -a.

Although in the case of running experiments it appears that everything is aligned when the experiment finishes and the user would just need to git commit???:

image

In the situation of new data being added I think that displaying the added (A) decoration for files / directories that have been added to DVC but not checked in to git would be useful to users. It would let them know that action still needs to be taken / encourage them to commit the "data placeholder(s)" into git.

If the above shows the correct order of operation then we should add repo level commands to short circuit the current workflow (IMO).

File Decoration

💯, create a checkbox in the review?

This is already done (so we can cross one thing off the list 👍🏻 ).

@mattseddon
Copy link
Member Author

Based on further discussion with @shcheklein this morning:

We want to overlay any git differences between the workspace and the previous commit onto the equivalent dvc tracked files. E.g if we have a git tracked data.dvc file and a dvc tracked data/ folder with items inside we will show the following:

data.dvc (git) status data/ (dvc) status visible status
M - M
- M M
M M M

Where - signifies unchanged from previous commit.

The next step is now to investigate dvc diff.

☝🏻 this looks pretty good. Here is some example output after dvc exp run + rm data/MNIST/raw/t10k-images-idx3-ubyte:

{
  added: [],
  deleted: [
    {
      path: 'data/MNIST/raw/t10k-images-idx3-ubyte'
    }
  ],
  modified: [
    {
      path: 'data/MNIST/raw/'
    },
    {
      path: 'logs/'
    },
    {
      path: 'logs/acc.tsv'
    },
    {
      path: 'logs/loss.tsv'
    },
    {
      path: 'model.pt'
    },
    {
      path: 'predictions.json'
    }
  ],
  renamed: [],
  'not in cache': []
}

I'll plug diff into our file decoration and see what happens next.

We (probably) want to parse out files like 'data/MNIST/raw/t10k-images-idx3-ubyte' from the SCM view as stated previously in a follow up point (#330 (comment)). Will keep that in mind as I spike this.

@dberenbaum
Copy link
Contributor

Nice summary! That makes sense to me, although I want to clarify one thing with the third row:

data.dvc (git) status data/ (dvc) status visible status
M M M

I think this would happen if changes were made and dvc committed, and then further changes to data/ were made and not dvc committed. If those further changes are dvc committed, no change in status will be seen (see the first row).

Is there an icon to commit the changes in the DVC row? What is this going to do in the third row example? What will an icon to revert the changes do (if it will exist)?

It seems like dvc commits are kind of analogous to staging changes in Git. Modifications may be staged or unstaged, but only staged modifications will be reflected in the Git commit. Staged modifications will show as unmodified after the Git commit, unstaged ones will still show as modified. Would it make sense to borrow from how Git staging is handled?

@mattseddon
Copy link
Member Author

Is there an icon to commit the changes in the DVC row? What is this going to do in the third row example? What will an icon to revert the changes do (if it will exist)?

There are both checkout and commit inline icons available at the moment. They will run those commands directly against the target.

When I dropped diff in place of status I started to understand this point better.

My understand now is that:

  • If we use only diff then we are always comparing directly to HEAD which means that we have no idea of the state of the files compared to the git tracked files available in the workspace.
  • If we only use status then we will miss files that are indeed different to the previous commit.

It seems like dvc commits are kind of analogous to staging changes in Git. Modifications may be staged or unstaged, but only staged modifications will be reflected in the Git commit. Staged modifications will show as unmodified after the Git commit, unstaged ones will still show as modified. Would it make sense to borrow from how Git staging is handled?

This really helped and I have made a prototype following this criteria:

data.dvc (git) status data/ (dvc) status visible status color (this is how the git extension differentiates)
M - M gitDecoration.stageModifiedResourceForeground
- M M gitDecoration.modifiedResourceForeground
M M M gitDecoration.modifiedResourceForeground

I.e if your dvc workspace matches what is going to be committed into git then we provide a stageModified decoration. Otherwise we just show the usual modified.

Here are some demos of these two different statuses in action ("unstaged" = blue, "staged" = yellow):

exp run -> at the end of the experiment all changes differ from the last commit but match dvc ->

Screen.Recording.2021-05-21.at.6.40.53.pm.mov

we can commit into git and roll forwards (not action needed in dvc scm):

Screen.Recording.2021-05-21.at.6.48.36.pm.mov

or discard and roll back (run dvc checkout after git reset):

Screen.Recording.2021-05-21.at.6.50.29.pm.mov

The exact UX needs TBD but I think we should look to short circuit the two steps wherever possible.

LMK what you think, happy to jump on a call and discuss next week.

Note: predictions.json is clearly misbehaving, even when it was checked into git it was showing up in dvc diff. I've asked about it in the dvc-eng channel:
https://iterativeai.slack.com/archives/CNQ95CG1K/p1621561571046800

@mattseddon
Copy link
Member Author

@shcheklein @dberenbaum

I think we have arrived at our basic use cases. We have:

Decorations

  1. See changes relative to HEAD and / or cache (Situational Awareness).

Explorer tree view

  1. View all directories and files available to the project (Visibility / Situational awareness)
  2. Push, pull, remove and delete those directories and files (Actions)

SCM view

  1. View all directories and files that have changed when compared to HEAD or cache (Visibility / Situational awareness)
  2. Checkout / commit and files or directories that have changed (Actions) - exact UX TBD.
  3. Checkout / commit / push or pull the entire repository (Actions)

If we are happy with these basic use cases then I will close this issue and move all remaining items into #330.

LMK what you think.

Thanks

@shcheklein
Copy link
Member

@mattseddon great summary, and sounds right to me. Thanks.

@mattseddon
Copy link
Member Author

Closing, will track and resolve all issues under #330. 🎉 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: extension Area: core extension 🎨 design Needs design input or is being actively worked on discussion product PR that affects product
Projects
None yet
Development

No branches or pull requests

3 participants