Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref: data status #3812

Merged
merged 15 commits into from
Sep 2, 2022
135 changes: 135 additions & 0 deletions content/docs/command-reference/data/status.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
# data status

Show changes in the data tracked by DVC in the workspace.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

## Synopsis

```usage
usage: dvc data status [-h] [-q | -v]
skshetry marked this conversation as resolved.
Show resolved Hide resolved
[--granular] [--unchanged]
[--untracked-files [{no,all}]]
[--json]
Comment on lines +8 to +11
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be a better visual grouping (even if it doesn't match the help output) cc @dberenbaum

Suggested change
usage: dvc data status [-h] [-q | -v]
[--granular] [--unchanged]
[--untracked-files [{no,all}]]
[--json]
usage: dvc data status [-h] [-q | -v] [--json] [--granular]
[--unchanged] [--untracked-files [{no,all}]]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't look important to me

Copy link
Contributor

@jorgeorpinel jorgeorpinel Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably secondary, yes. It's a quality-related best practice for the cmd ref/ usage blocks we started applying recently, see #3345.

```

## Description
shcheklein marked this conversation as resolved.
Show resolved Hide resolved

The `data status` command displays the state of the working directory and the
changes with respect to the last Git commit (`HEAD`). It shows you what new
changes have been committed to DVC, which haven't been committed, which files
aren't being tracked by DVC and Git, and what files are missing from the
<abbr>cache</abbr>.

The `dvc data status` command only outputs information, it won't modify or
change anything in your working directory. It's a good practice to check the
state of your repository before doing `dvc commit` or `git commit` so that you
don't accidentally commit something you don't mean to.

An example output might look something like follows:
Comment on lines +16 to +27
Copy link
Contributor

@jorgeorpinel jorgeorpinel Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `data status` command displays the state of the working directory and the
changes with respect to the last Git commit (`HEAD`). It shows you what new
changes have been committed to DVC, which haven't been committed, which files
aren't being tracked by DVC and Git, and what files are missing from the
<abbr>cache</abbr>.
The `dvc data status` command only outputs information, it won't modify or
change anything in your working directory. It's a good practice to check the
state of your repository before doing `dvc commit` or `git commit` so that you
don't accidentally commit something you don't mean to.
An example output might look something like follows:
Displays the state of the <abbr>workspace</abbr> compared to the last Git commit
(`HEAD`). This includes committed and uncommitted additions, updates, and
deletions of DVC-tracked files. Checking the state of your tracked data is
useful to know what to `dvc add` (or `dvc commit`) and `git commit`. Example:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wha -> what

it was fine before I think, don't see a reason to change this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reason to change this

Minor but not cosmetic: Reduced from 3 paragraph Desc. intro to 1. (We want explanations in refs to stay short.) I also removed some sentences that aren't needed IMO. Applied some other existing patterns (e.g. don't mention the command name at the beginning so it's not repetitive later when you need it in other paragraphs).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applying existing patterns is fine, everything else feels very cosmetic still + changes the meaning / original intention (which is fine, but we don't have strong enough reason to spend time reviewing this to my mind in this case and debate one more time about intentions in the text, benefits, etc, etc). Please, unless it's super important let's not do this - it takes a lot of time to review it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'm making a point to spend time on these for now just to compile mayor existing practices (example).


```dvc
$ dvc data status
Not in cache:
(use "dvc fetch <file>..." to download files)
data/data.xml

DVC committed changes:
(git commit the corresponding dvc files to update the repo)
modified: data/features/

DVC uncommitted changes:
(use "dvc commit <file>..." to track changes)
(use "dvc checkout <file>..." to discard changes)
deleted: model.pkl
(there are other changes not tracked by dvc, use "git status" to see)
```

As shown above, the `dvc data status` displays changes in multiple categories:

- _Not in cache_ indicates that the hash for files are recorded in `dvc.lock`
skshetry marked this conversation as resolved.
Show resolved Hide resolved
and `.dvc` files but the corresponding cache files are missing.
- _DVC committed changes_ indicates that there are changes that are
`dvc-commit`-ed that differs with the last Git commit. There might be more
dberenbaum marked this conversation as resolved.
Show resolved Hide resolved
detailed state on how each of those files changed: _added_, _modified_,
_deleted_ and _unknown_.
- _DVC uncommitted changes_ indicates that there are changes in the working
directory that are not `dvc commit`-ed yet. Same as _DVC committed changes_,
there might be more detailed state on how each of those files changed.
- _Untracked files_ shows the files that are not being tracked by DVC and Git.
This is disabled by default, unless [`--untracked-files`](#--untracked-files)
is specified.
- _DVC Unchanged files_ shows the files that are not changed. This is not shown
by default, unless [`--unchanged`](#--unchanged) is specified.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

By default, `dvc data status` does not show individual changes inside the
skshetry marked this conversation as resolved.
Show resolved Hide resolved
tracked directories, which can be enabled with [`--granular`](#--granular)
option.

## Options

- `--granular` - show granular, file-level information of the changes for
DVC-tracked directories. By default, `dvc data status` does not show
individual changes for files inside the tracked directories.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

- `--untracked-files` - show files that are not being tracked by DVC and Git.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if it's tracked by EITHER DVC or Git it will not be included here?


- `--unchanged` - show unchanged DVC-tracked files.

- `--json` - prints the command's output in easily parsable JSON format, instead
of a human-readable output.

- `-h`, `--help` - prints the usage/help message, and exit.

- `-q`, `--quiet` - do not write anything to standard output.

- `-v`, `--verbose` - displays detailed tracing information.

## Examples

```dvc
$ dvc data status
Not in cache:
(use "dvc fetch <file>..." to download files)
data/data.xml

DVC committed changes:
(git commit the corresponding dvc files to update the repo)
modified: data/features/

DVC uncommitted changes:
(use "dvc commit <file>..." to track changes)
skshetry marked this conversation as resolved.
Show resolved Hide resolved
(use "dvc checkout <file>..." to discard changes)
deleted: model.pkl
(there are other changes not tracked by dvc, use "git status" to see)
```

This shows that the `data/data.xml` is missing from the cache, `data/features/`
a directory, has changes that are being tracked by DVC but is not Git committed
yet, and a file `model.pkl` has been deleted from the workspace. The
`data/features/` directory is modified, but there is no further details to what
changed inside. The `--granular` option can provide more information on that.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

## Example: Granular output

Following on from the above example, using `--granular` will show file-level
information for the changes:

```dvc
$ dvc data status --granular
Not in cache:
(use "dvc fetch <file>..." to download files)
data/data.xml

DVC committed changes:
(git commit the corresponding dvc files to update the repo)
added: data/features/foo

DVC uncommitted changes:
(use "dvc commit <file>..." to track changes)
shcheklein marked this conversation as resolved.
Show resolved Hide resolved
(use "dvc checkout <file>..." to discard changes)
deleted: model.pkl
(there are other changes not tracked by dvc, use "git status" to see)
```

Now there's more information in _DVC committed changes_ regarding the changes in
`data/features`. From the output, it shows that there is a new file added to
`data/features`: `data/features/foo`.
11 changes: 11 additions & 0 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -218,6 +218,17 @@
"label": "dag",
"slug": "dag"
},
{
"label": "data",
"slug": "data",
"source": false,
"children": [
{
"label": "data status",
"slug": "status"
}
]
},
{
"label": "destroy",
"slug": "destroy"
Expand Down