Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support SCM functionality #772

Closed
mattseddon opened this issue Sep 3, 2021 · 10 comments
Closed

Better support SCM functionality #772

mattseddon opened this issue Sep 3, 2021 · 10 comments
Assignees
Labels
A: trees Area: SCM and DVC-tracked trees discussion enhancement New feature or request product PR that affects product

Comments

@mattseddon
Copy link
Member

mattseddon commented Sep 3, 2021

From #318 the basic user cases are as follows:

Proposal

Notion.

SCM view

  1. View all directories and files that have changed when compared to HEAD or cache (Visibility / Situational awareness)
  2. Checkout / commit and files or directories that have changed (Actions) - exact UX TBD.
  3. Checkout / commit / push or pull the entire repository (Actions)

Statuses that we currently provide in the extension

Status SCM View Decorations Provided ** Sourced from Notes
added Y Y diff + list
deleted Y Y diff + list
modified Y Y diff + list + status
notInCache Y Y diff + list
renamed Y Y diff + list
stageModified Y Y diff + list + status For a detailed explanation of modified vs stageModified see #318 (comment)
untracked Y Y git this is untracked with respect to both git and dvc. We show these files because the user may want to dvc add them.
tracked Y Y list we decorate tracked because they are generally "git ignored" which will give them a "greyed out" decoration

** Where possible we match the git extension's decorations because we are trying to make the extension feel as native as possible. Our SCM integration is designed to show the user the state of the workspace with respect to the most recent commit.

Current approach (parallel CLI Commands)

name command reason
list dvc list . --dvc-only -R --show-json provides a list of all tracked files that we use for both decoration and SCM purposes. In the SCM view all files that we show must be tracked by DVC. We do this because we end up with untracked but modified (duplicates) items in the tree from diff if we do not
diff dvc diff --show-json we map the output of diff directly to the list output to set all added, deleted, renamed, notInCache. We use it in combination with the status output to determine the difference between modified and "stage modified"
status dvc status --show-json only used to determine the difference between modified and "stage modified"

We currently try to run all three of the above commands in parallel. If any of the commands fail then we will retry all three until they have all completed without error. We do this to best mitigate stale information ending up in the extension.

Issues with the current approach

  1. It's still slow (General performance of trees #608)
  2. We are unsure of the what the actual UI / UX should be (Decide on and implement UX for checkout / commit workflow in SCM view #609)
  3. A lot of data is sent between the CLI and extension that is unused (example: in get-started-experiments after first running an experiment the output of diff contains ~80k "added" files, none of these files are tracked by dvc so we filter all of the records out)
  4. We have issues running multiple commands in parallel (After reloading the window experiments part is stuck in the "loading" state #767 (comment)) <- this is particularly important because it means we cannot currently run the extension against get-started-experiments

Options for mitigation

# option pros cons
1 Run commands sequentially locks should no longer be an issue even slower
2 Only rerun failed commands also mitigates lock issue involves more complicated logic, possibility of stale data
3 Make all 3 commands lockless allows us to continue to run all commands in parallel involves work from the CLI team and is only an interim solution, complicates internal of DVC
4 Combine commands into single command that the integration can run limits the amount of data needing to be transferred between the cli and extension, should be faster, cuts out grouped retry logic more effort required, unsure as to benefit to general users
5 Replace CLI calls with event driven architecture eliminates the need to call the CLI, could serve multiple clients requires even more work and is not a short or even medium term solution
6 Make commands "lightweight" (add --dvc-only) would limit the amount of data being passed and could speed things up unsure as to the benefit to general users, still requires effort, could still run into lock issues

My preference would be to start work on 4 as it would actually help us move towards 5.

@mattseddon mattseddon added enhancement New feature or request discussion large files A: trees Area: SCM and DVC-tracked trees product PR that affects product labels Sep 3, 2021
@mattseddon
Copy link
Member Author

cc @shcheklein @dberenbaum @efiop.

Chat about this next time we are all together.

@mattseddon
Copy link
Member Author

mattseddon commented Sep 13, 2021

Note: The native VS Code git extension gets all of the information that it needs from git status -z -u. It creates a child process to run the command in much the same way that we do.

If we are going to emulate what git does for status and diff then the correct command to update would be status. Here is what we can currently get from git:

image

IMO the current DVC implementation of status is doing something else and should be moved to be a subcommand e.g dvc stage status.

@mattseddon
Copy link
Member Author

mattseddon commented Sep 13, 2021

@mattseddon
Copy link
Member Author

Having discussed with @efiop and @dberenbaum in the 21/09/14 cross team meeting we came to the following conclusion for next steps:

  1. We need a single command.
  2. The best candidate is status (to match up with git).
  3. It should have an option to display paths only tracked by DVC (--dvc-only).
  4. It should have an option (or separate command) to show stage information (dvc stage status).
  5. It will have all the same available statuses as diff with a further status used to separate files between modified against cache and modified against HEAD.

Points to discuss:

  1. Should the current dvc status be migrated to a separate command (e.g dvc stage status) or should there be a flag added to condense the information to what we need?
  2. What will the new status be named?
  3. How will the new status be displayed in the output (to distinguish from modified)? The way that git handles this is shown above (Better support SCM functionality #772 (comment))

I will start a document now in notion here. We can continue the discussion there.

@dberenbaum
Copy link
Contributor

Thanks, @mattseddon! Looks great.

We actually have a proposal template in Notion. I didn't want to ask you to do that much work, but you basically filled the template already in the doc you created, so I transferred your text into a proposal in https://www.notion.so/iterative/Consolidate-repo-status-ed3cd60f706f4fcaba1d3f3cac1498e9.

I'd like to flesh it out with some DVC requirements since this work should also be about improving the user experience within DVC. Take a look and let me know if/when it's okay to start adding on to the proposal.

@mattseddon

This comment has been minimized.

@mattseddon
Copy link
Member Author

Issue when running experiments against example-dvc-experiments: microsoft/vscode#133425

cc @shcheklein (for when you see it)

@mattseddon mattseddon changed the title Better Support SCM functionality Better support SCM functionality Sep 22, 2021
@mattseddon
Copy link
Member Author

Demo on why we need two different modified statuses

Screen.Recording.2021-10-06.at.1.42.05.pm.mov

Red = "stage Modified" (changed against HEAD, DVC up to date).
Blue = "modified" (DVC not up to date).

Steps

  1. Run experiment (video start just after it ends). We can easily see from the SCM view that there are files in the workspace that are modified and git tracked. They are denoted with a blue M and shown in the top tree.
  2. Stage the git tracked files (equivalent of git add .). They are then shown under the Staged Chages resource group. At that point everything matches and we are ready to git commit.
  3. Discard all of the git changes from the workspace (equivalent of git reset --hard HEAD). The DVC files are then shown as simply "modified" (again a blue M), if we performed a git commit at this point nothing would happen.
  4. Run dvc commit. This brings us back to the "stage modified" status for all of the DVC tracked files. Again, we could potentially git commit from this point.

Without a way to differentiate between the two sets of changes (i.e changes between the HEAD and the workspace and what DVC expects and the workspace) this would become even less intuitive to the user.

Hopefully this helps to make more sense of why these two statuses are needed.

cc @dberenbaum @efiop.

Thanks,

@dberenbaum
Copy link
Contributor

@mattseddon There is now a detailed proposal and examples for dvc status in https://www.notion.so/iterative/Consolidate-repo-status-ed3cd60f706f4fcaba1d3f3cac1498e9. PTAL and provide feedback when you have a chance.

@mattseddon
Copy link
Member Author

Work here has been done. data:status is currently being implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: trees Area: SCM and DVC-tracked trees discussion enhancement New feature or request product PR that affects product
Projects
None yet
Development

No branches or pull requests

3 participants