Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref: data status updates #3924

Merged
merged 16 commits into from
Oct 18, 2022
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 29 additions & 49 deletions content/docs/command-reference/data/status.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# data status

Show changes in the data tracked by DVC in the workspace.
Show changes to the files and directories tracked by DVC.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command name is data. I am not sure what we get by replacing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are specific about what we mean by data. Docs should explain 🙂


## Synopsis

Expand Down Expand Up @@ -43,32 +43,37 @@ DVC uncommitted changes:
(there are other changes not tracked by dvc, use "git status" to see)
```

As shown above, the `dvc data status` displays changes in multiple categories:

- _Not in cache_ indicates that the hash for files are recorded in `dvc.lock`
and `.dvc` files but the corresponding cache files are missing.
- _DVC committed changes_ indicates that there are changes that are
`dvc-commit`-ed that differs with the last Git commit. There might be more
detailed state on how each of those files changed: _added_, _modified_,
_deleted_ and _unknown_.
Comment on lines -60 to -62
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question though: what's an unknown change? Does it apply to untracked files too? I didn't know where to mention that in the updated text.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, let's say you clone a repo with a tracked directory but haven't done dvc fetch/pull. You have an md5 in the .dvc file but the directory is empty. If you modify the directory (for example, add a file to it), DVC can tell that the directory doesn't match the md5 but has no info about the granular changes to the files in that directory.

Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DVC can tell that the directory doesn't match the md5 but has no info about the granular changes to the files in that directory

So it only applies to --granular right? (Assuming that generalizing that example is the one scenario for this.) I added a note in that option instead of trying to cover it up here (it still counts as "new, modified, or deleted"). aca9d8a

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skshetry Do you know if it can ever apply to non-granular data?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it only applies to granular.

- _DVC uncommitted changes_ indicates that there are changes in the working
directory that are not `dvc commit`-ed yet. Same as _DVC committed changes_,
there might be more detailed state on how each of those files changed.
- _Untracked files_ shows the files that are not being tracked by DVC and Git.
This is disabled by default, unless [`--untracked-files`](#--untracked-files)
is specified.
- _DVC unchanged files_ shows the files that are not changed. This is not shown
by default, unless [`--unchanged`](#--unchanged) is specified.

By default, `dvc data status` does not show individual changes inside the
tracked directories, which can be enabled with [`--granular`](#--granular)
option.
`dvc data status` displays changes in multiple categories:

- `Not in cache` indicates that there are file records (hashes) in `.dvc` or
shcheklein marked this conversation as resolved.
Show resolved Hide resolved
`dvc.lock` files, but the corresponding <abbr>cache</abbr> files are missing.
This may happen after cloning a DVC repository but before using `dvc pull` (or
`dvc fetch`) to download the data; or after using `dvc gc`.

- `Committed changes` are new, modified, or deleted tracked files or directories
shcheklein marked this conversation as resolved.
Show resolved Hide resolved
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
that have been [committed to DVC]. These may be ready for committing to Git.

- `Uncommitted changes` are new, modified, or deleted tracked files or
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
directories that have not been [committed to DVC] yet. You can `dvc add` or
`dvc commit` these.

- `Untracked files` have not been added to DVC (nor Git). Only shown if the
`--untracked-files` flag is used.

- `Unchanged files` have no modifications. Only shown if the `--unchanged` flag
is used.

Individual changes to files inside [directories tracked as a whole] are not
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
shown by default but this can be enabled with the `--granular` flag.

[committed to dvc]: /doc/command-reference/commit
[directories tracked as a whole]:
/doc/command-reference/add#adding-entire-directories

## Options

- `--granular` - show granular, file-level information of the changes for
DVC-tracked directories. By default, `dvc data status` does not show
individual changes for files inside the tracked directories.
- `--granular` - show granular file-level changes inside DVC-tracked
directories. Not included by default

- `--untracked-files` - show files that are not being tracked by DVC and Git.

Expand All @@ -83,31 +88,6 @@ option.

- `-v`, `--verbose` - displays detailed tracing information.

## Examples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why removing it?

Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest removing it because it's already used in the description.

You asked me to make a PR with the most important changes from my feedback and IMO this is one of them: #3812 (comment).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3812 (comment) - the problem is it was discussed already. Question - do you have an actually strong opinion about this? :) (I'm personally fine and Dave was fine it seems)

Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Sep 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize it had been discussed (I didn't read all the resolved comments). I still think it's unnecessary since the --granular example can still refer to the one in the description.

It's not a very strong opinion but it does go against good practices for cmd ref examples, I think: they should add special value to the doc, not just cover obvious cases. And to avoid redundancy in general.

But anyway, I rolled it back in 725c7e4 since it was discussed by them.


```dvc
$ dvc data status
Not in cache:
(use "dvc fetch <file>..." to download files)
data/data.xml

jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
DVC committed changes:
(git commit the corresponding dvc files to update the repo)
modified: data/features/

jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
DVC uncommitted changes:
(use "dvc commit <file>..." to track changes)
(use "dvc checkout <file>..." to discard changes)
deleted: model.pkl
(there are other changes not tracked by dvc, use "git status" to see)
```

This shows that the `data/data.xml` is missing from the cache, `data/features/`
a directory, has changes that are being tracked by DVC but is not Git committed
yet, and a file `model.pkl` has been deleted from the workspace. The
`data/features/` directory is modified, but there is no further details to what
changed inside. The `--granular` option can provide more information on that.

## Example: Granular output

Following on from the above example, using `--granular` will show file-level
Expand Down